JP2021021852A

JP2021021852A - Voice recognition device, electronic apparatus, control method and control program

Info

Publication number: JP2021021852A
Application number: JP2019138676A
Authority: JP
Inventors: 鈴木　直也; Naoya Suzuki; 直也鈴木
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2019-07-29
Filing date: 2019-07-29
Publication date: 2021-02-18

Abstract

To provide a voice recognition device, an electronic apparatus, a control method and a control program, which improve recognition precision of voice which a person utters in collected sound.SOLUTION: A voice recognition device (control unit 10) in an interactive robot R comprises: a detected sound direction specification unit for specifying a sound generation direction of a sound source from a plurality of detected sounds acquired from a plurality of microphones; a human image detection unit for detecting a human image on the basis of imaging data acquired by imaging the sound generation direction of the sound source or/and a sensor signal acquired by sensing the sound generation direction of the sound source; and a detected sound acquisition validity/invalidity determination unit for determining acquisition of the plurality of detected sounds to be valid or invalid on the basis of information acquired from the human image detection unit.SELECTED DRAWING: Figure 1

Description

本発明は、音声認識装置等に関する。 The present invention relates to a voice recognition device and the like.

近年、センサーやマイク等で発話を集音し、集音した音を人が発話した音声として認識する音声認識装置が種々開発されている。 In recent years, various voice recognition devices have been developed that collect utterances with a sensor, a microphone, or the like and recognize the collected sounds as voices uttered by a person.

このような音声認識装置に係る技術として、人の発話以外で誤動作することを防ぐための技術が開示されている。例えば、特許文献１には、３つマイクを備え、これらのマイクで集音した音から特定方向の音声データを抽出し、抽出した音声データに基づいて人の音声データであると推定する集音装置が開示されている。 As a technique related to such a voice recognition device, a technique for preventing malfunctions other than human speech is disclosed. For example, Patent Document 1 includes three microphones, extracts voice data in a specific direction from the sounds collected by these microphones, and presumes that the sound is human voice data based on the extracted voice data. The device is disclosed.

特開２０１６−４６７６９号公報（２０１６年４月４日公開）Japanese Unexamined Patent Publication No. 2016-46769 (published on April 4, 2016)

しかしながら、上述した集音装置では、人の音声と例えばテレビ番組の音等の環境音とが同じ音質、同じ音量である場合には、どちらの音が人の音声なのか判別できない虞がある。 However, in the above-mentioned sound collecting device, when the human voice and the environmental sound such as the sound of a TV program have the same sound quality and the same volume, it may not be possible to determine which sound is the human voice.

本発明の一態様は、上述の問題点に鑑みたものであり、集音した音のうち人が発話した音声の認識精度を向上する音声認識装置等を実現することを目的とする。 One aspect of the present invention is in view of the above-mentioned problems, and an object of the present invention is to realize a voice recognition device or the like that improves the recognition accuracy of a voice spoken by a person among the collected sounds.

上記の課題を解決するために、本発明の一態様に係る音声認識装置は、複数のマイクから取得した複数の検出音より音源の音の発生方向を特定する検出音方向特定部と、撮像部が前記音源の音の発生方向を撮像して取得した撮像データ又は／及び人感センサー部が前記音源の音の発生方向をセンシングして取得したセンサー信号に基づいて、人像を検出する人像検出部と、前記人像検出部から取得した情報を基に、前記人像が確認できる場合に前記複数の検出音の取得を有効と判断する又は前記人像が確認できない場合に前記複数の検出音の取得を無効と判断する検出音取得有効／無効判断部と、を備えることを特徴とする。 In order to solve the above problems, the voice recognition device according to one aspect of the present invention includes a detection sound direction specifying unit that specifies a sound generation direction of a sound source from a plurality of detected sounds acquired from a plurality of microphones, and an imaging unit. Is an imaging data acquired by imaging the sound generation direction of the sound source and / or a human image detection unit that detects a human image based on a sensor signal acquired by the human sensor unit sensing the sound generation direction of the sound source. Based on the information acquired from the human image detection unit, it is determined that the acquisition of the plurality of detected sounds is valid when the human image can be confirmed, or the acquisition of the plurality of detected sounds is invalid when the human image cannot be confirmed. It is characterized by including a detection sound acquisition valid / invalid determination unit for determining that.

上記の課題を解決するために、本発明の一態様に係る電子機器は、複数のマイクから取得した複数の検出音より音源の音の発生方向を特定する検出音方向特定部と、前記音源の音の発生方向を撮像して取得した撮像データ又は／及び前記音源の音の発生方向をセンシングして取得したセンサー信号に基づいて、人像を検出する人像検出部と、人像検出部から取得した情報を基に、前記人像が確認できる場合に前記複数の検出音の取得を有効とする又は前記人像が確認できない場合に前記複数の検出音の取得を無効と判断する検出音取得有効／無効判断部と、を有する音声認識装置と、前記撮像部を前記検出音の発生方向に駆動する駆動部と、を備えることを特徴とする。 In order to solve the above problems, the electronic device according to one aspect of the present invention includes a detection sound direction specifying unit that specifies a sound generation direction of a sound source from a plurality of detected sounds acquired from a plurality of microphones, and a detection sound direction specifying unit of the sound source. A human image detection unit that detects a human image and information acquired from the human image detection unit based on the imaging data acquired by imaging the sound generation direction and / and the sensor signal acquired by sensing the sound generation direction of the sound source. Based on the above, the detection sound acquisition valid / invalid determination unit that enables the acquisition of the plurality of detected sounds when the human image can be confirmed or determines that the acquisition of the plurality of detected sounds is invalid when the human image cannot be confirmed. It is characterized by including a voice recognition device having the above, and a drive unit that drives the image pickup unit in the direction in which the detection sound is generated.

上記の課題を解決するために、本発明の一態様に係る音声認識装置の制御方法は、複数のマイクから取得した複数の検出音より音源の音の発生方向を特定する検出音方向特定ステップと、前記音源の音の発生方向を撮像して取得した撮像データ又は／及び前記音源の音の発生方向をセンシングして取得したセンサー信号に基づいて、人像を検出する人像検出ステップと、人像検出ステップから取得した情報を基に、前記人像が確認できる場合に前記複数の検出音の取得を有効と判断する又は前記人像が確認できない場合に前記複数の検出音の取得を無効と判断する検出音取得有効／無効判断ステップと、を含む、ことを特徴とする。 In order to solve the above problems, the control method of the voice recognition device according to one aspect of the present invention includes a detection sound direction specifying step for specifying a sound generation direction of a sound source from a plurality of detected sounds acquired from a plurality of microphones. , A human image detection step for detecting a human image and a human image detection step based on the imaging data acquired by imaging the sound generation direction of the sound source and / and the sensor signal acquired by sensing the sound generation direction of the sound source. Based on the information acquired from, it is determined that the acquisition of the plurality of detected sounds is valid when the human image can be confirmed, or the acquisition of the plurality of detected sounds is determined to be invalid when the human image cannot be confirmed. It is characterized by including an valid / invalid judgment step.

本発明の一態様によれば、集音した音のうち人が発話した音声の認識精度を向上することができる。 According to one aspect of the present invention, it is possible to improve the recognition accuracy of the voice uttered by a person among the collected sounds.

本発明の実施形態１に係る対話ロボットの要部構成を示すブロック図である。It is a block diagram which shows the main part structure of the dialogue robot which concerns on Embodiment 1 of this invention. 対話ロボットを示す図である。It is a figure which shows the dialogue robot. 対話ロボットの動作例を示す図である。It is a figure which shows the operation example of an interactive robot. 対話ロボットの処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of an interactive robot. 本発明の実施形態２に係る対話ロボットの動作例を示す図である。It is a figure which shows the operation example of the interactive robot which concerns on Embodiment 2 of this invention. 対話ロボットの処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of an interactive robot. 本発明の実施形態３に係る対話ロボットの要部構成を示すブロック図である。It is a block diagram which shows the main part structure of the dialogue robot which concerns on Embodiment 3 of this invention. 対話ロボットの動作例を示す図である。It is a figure which shows the operation example of an interactive robot. 対話ロボットの処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of an interactive robot. 本発明の実施形態４に係る対話ロボットの要部構成を示すブロック図である。It is a block diagram which shows the main part structure of the dialogue robot which concerns on Embodiment 4 of this invention. 対話ロボットの動作例を示す図である。It is a figure which shows the operation example of an interactive robot. 対話ロボットの処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of an interactive robot.

［実施形態１］
本開示の実施形態１は、図１〜図４を用いて説明する。図１は、本実施形態に係る対話ロボットＲの要部構成を示すブロック図である。なお、以下の説明において、制御部は音声認識装置として機能するため重複する同音声認識装置の説明は省略する。 [Embodiment 1]
Embodiment 1 of the present disclosure will be described with reference to FIGS. 1 to 4. FIG. 1 is a block diagram showing a main configuration of the dialogue robot R according to the present embodiment. In the following description, since the control unit functions as a voice recognition device, the duplicate description of the voice recognition device will be omitted.

対話ロボットＲは、ユーザとしての人の発話とそれ以外の音（例えばテレビ番組の音等）を認識して、人の発話を有効とし、それ以外の音を無効とする電子機器である。 The dialogue robot R is an electronic device that recognizes a person's utterance as a user and other sounds (for example, the sound of a television program), enables the person's utterance, and invalidates the other sounds.

図１に示すように、対話ロボットＲは、マイク２０と、スピーカ３０と、音声認識装置として機能する制御部１０と、駆動部４０と、撮像部５０と、人感センサー部６０と、記憶部７０とを備える。 As shown in FIG. 1, the dialogue robot R includes a microphone 20, a speaker 30, a control unit 10 that functions as a voice recognition device, a drive unit 40, an imaging unit 50, a motion sensor unit 60, and a storage unit. 70 and.

マイク２０は、音を検出する入力装置である。マイクの種類は問わないが、後述する検出音方向特定部１２において、検出音の方向を特定できる程度の検出精度および指向性を有している。マイク２０は、後述する検出制御部１８により音検出の開始および停止が制御される。対話ロボットＲは、マイク２０を複数個備えている。また、対話ロボットＲは、複数個のマイク２０，２０，２０を、それぞれ異なる方向に向けて配置することが望ましい。これにより、後述する検出音方向特定部１２による検出音（音源）の方向特定の精度を向上させることができる。 The microphone 20 is an input device that detects sound. The type of microphone is not limited, but the detection sound direction specifying unit 12 described later has detection accuracy and directivity to the extent that the direction of the detected sound can be specified. The microphone 20 is controlled to start and stop sound detection by a detection control unit 18 described later. The dialogue robot R includes a plurality of microphones 20. Further, it is desirable that the dialogue robot R arranges a plurality of microphones 20, 20, 20 in different directions. As a result, it is possible to improve the accuracy of specifying the direction of the detected sound (sound source) by the detected sound direction specifying unit 12 described later.

スピーカ３０は、後述する出力制御部１９の制御に従って、応答内容であるメッセージを音声出力するものである。対話ロボットＲはスピーカ３０を複数備えてもよい。ここで、「応答」とは、音声、動作、光、およびこれらの組み合わせで示される、対話ロボットＲの発話に対する反応を意味する。 The speaker 30 outputs a message as a response content by voice according to the control of the output control unit 19 described later. The dialogue robot R may include a plurality of speakers 30. Here, the "response" means a reaction to the utterance of the interactive robot R, which is indicated by voice, motion, light, and a combination thereof.

制御部１０は、対話ロボットＲを統括的に制御するＣＰＵ（Central Processing Unit）である。制御部１０は、機能ブロックとして、検出音取得部１１と、検出制御部１８と、出力制御部１９を含む。なお、制御部は音声認識装置として機能する。 The control unit 10 is a CPU (Central Processing Unit) that collectively controls the interactive robot R. The control unit 10 includes a detection sound acquisition unit 11, a detection control unit 18, and an output control unit 19 as functional blocks. The control unit functions as a voice recognition device.

検出音取得部１１は、マイク２０からの検出音を取得するものである。検出音取得部１１は、複数のマイク２０からそれぞれの検出音を区別して取得する。また、検出音取得部１１は、各マイク２０の検出音を任意の長さで区切って、複数回にわたり取得する。検出音取得部１１は、検出音方向特定部１２、人像検出部１３および検出音取得有効／無効判断部１４を含む。 The detection sound acquisition unit 11 acquires the detection sound from the microphone 20. The detection sound acquisition unit 11 distinguishes and acquires each detection sound from the plurality of microphones 20. Further, the detection sound acquisition unit 11 divides the detection sound of each microphone 20 by an arbitrary length and acquires the detection sound a plurality of times. The detection sound acquisition unit 11 includes a detection sound direction specifying unit 12, a human image detection unit 13, and a detection sound acquisition valid / invalid determination unit 14.

検出音方向特定部１２は、マイク２０が集音する検出音がどの方向から到来したものかを特定するものである。すなわち、検出音方向特定部１２は、検出音の位相差、音量差から音源の方向を推定し、検出音（音声等）の到来（発生）方向を特定する。検出音方向特定部１２は、複数のマイク２０，２０，２０から取得した複数の検出音から音源の音の発生方向を総合的に特定する。制御部１０は、検出音方向特定部１２が特定した検出音の到来（発生）方向を示す到来方向情報に基づき、後述する駆動部４０を駆動する。 The detection sound direction specifying unit 12 specifies from which direction the detection sound collected by the microphone 20 comes from. That is, the detection sound direction specifying unit 12 estimates the direction of the sound source from the phase difference and volume difference of the detected sound, and specifies the arrival (generation) direction of the detected sound (voice or the like). The detected sound direction specifying unit 12 comprehensively identifies the sound generation direction of the sound source from the plurality of detected sounds acquired from the plurality of microphones 20, 20, 20. The control unit 10 drives the drive unit 40, which will be described later, based on the arrival direction information indicating the arrival (generation) direction of the detection sound specified by the detection sound direction identification unit 12.

人像検出部１３は、検出音方向特定部１２で特定した検出音の到来（発生）方向を、後述する撮像部５０で撮像して取得した撮像データ又は／及び後述する人感センサー部６０でセンシングして取得したセンサー信号に基づいて、人像が存在するか否かを検出する。 The human image detection unit 13 senses the arrival (generation) direction of the detection sound specified by the detection sound direction identification unit 12 by the imaging data acquired by imaging with the imaging unit 50 described later and / or by the motion sensor unit 60 described later. Based on the sensor signal obtained in the above, it is detected whether or not a human image exists.

検出音取得有効／無効判断部１４は、人像検出部１３から取得する情報に基づいて、検出音が人の発話由来である場合を検出音の取得を有効とする判断し、又は検出音が人の発話由来でない場合を検出音の所得を無効とする判断を行う。 Based on the information acquired from the human image detection unit 13, the detection sound acquisition valid / invalid determination unit 14 determines that the detection sound acquisition is valid when the detection sound is derived from a person's utterance, or the detection sound is a person. Judgment is made to invalidate the income of the detected sound when it is not derived from the utterance of.

撮像部５０は、検出音方向特定部１２で特定した検出音の発生方向を撮像し、取得した撮像データを人像検出部１３に送信する。例えば、撮像部５０は、ＣＣＤ又はＣＭＯＳ等の撮像素子とＡ／Ｄ変換等の回路を含む。 The imaging unit 50 images the generation direction of the detected sound specified by the detection sound direction specifying unit 12, and transmits the acquired imaging data to the human image detecting unit 13. For example, the image pickup unit 50 includes an image pickup element such as a CCD or CMOS and a circuit such as A / D conversion.

人感センサー部６０は、検出音方向特定部１２で特定した検出音の発生方向をセンシングし、取得したセンシング信号を人像検出部１３に送信する。例えば、人感センサー部６０は、人の顔等の温度を感知できる温度センサーを含む。また、人感センサー部としては、温度センサーの他に、赤外線センサーや赤外線センサーと超音波センサーを組み合わせたものであってもよい。 The motion sensor unit 60 senses the generation direction of the detected sound specified by the detection sound direction specifying unit 12, and transmits the acquired sensing signal to the human image detecting unit 13. For example, the motion sensor unit 60 includes a temperature sensor that can detect the temperature of a person's face or the like. Further, as the motion sensor unit, in addition to the temperature sensor, an infrared sensor or an infrared sensor and an ultrasonic sensor may be combined.

駆動部４０は、撮像部５０又は／及び人感センサー部６０を検出音方向特定部１２で特定した音源の音（検出音）の発生方向に向けるように駆動する。駆動部４０には、後述する対話ロボットＲを移動する移動部４１を含む。 The drive unit 40 drives the image pickup unit 50 and / and the motion sensor unit 60 so as to direct the sound (detection sound) of the sound source specified by the detection sound direction specification unit 12. The drive unit 40 includes a moving unit 41 that moves the interactive robot R, which will be described later.

記憶部７０は、制御部１０が実行する処理に必要なデータを格納するメモリである。記憶部７０は少なくとも、応答文テーブル等を含む。応答文テーブルは、所定の文またはキーワードに、応答内容が対応付けて登録されたデータテーブルである。例えば、応答内容としては、文またはキーワードに対する回答となるメッセージの文字列を登録しておくこととする。 The storage unit 70 is a memory for storing data necessary for processing executed by the control unit 10. The storage unit 70 includes at least a response sentence table and the like. The response statement table is a data table in which the response contents are registered in association with a predetermined sentence or keyword. For example, as the response content, the character string of the message that is the answer to the sentence or keyword is registered.

出力制御部１９は、認識したワードに対する応答がある場合、予め登録された応答メッセージをスピーカ３０に出力させる。 When there is a response to the recognized word, the output control unit 19 causes the speaker 30 to output a pre-registered response message.

検出制御部１８は、マイク２０の音の検出を開始させ、マイク２０の音の検出を停止させる。 The detection control unit 18 starts the detection of the sound of the microphone 20 and stops the detection of the sound of the microphone 20.

次に、対話ロボットＲの具体的な構造および動作について、図２および図３を用いて説明する。図２は、対話ロボットＲを示す図である。図３は、対話ロボットＲの動作例を示す図である。 Next, the specific structure and operation of the interactive robot R will be described with reference to FIGS. 2 and 3. FIG. 2 is a diagram showing a dialogue robot R. FIG. 3 is a diagram showing an operation example of the interactive robot R.

図２に示すように、例えば、対話式ロボットＲは、頭部Ｒ２０、胴体部Ｒ３０、両腕部Ｒ４０、両足部Ｒ５０を備えた人型構造である。対話ロボットＲの頭部Ｒ２０の前頭部Ｒ２０ｂにはその両眼部Ｒ２１，Ｒ２１の上方向にそれぞれ１つずつマイク２０が配置される。さらに、後頭部Ｒ２０ｃには１つのマイク２０が配置されている。対話ロボットＲは、ロボット本体Ｒ１０の後方に配置したテレビ装置Ｔからのテレビ番組の音（検出音）の取得を無効としたあと、ロボット本体Ｒ１０の前方から発話された人の音声（検出音）の取得を有効とする場合について説明する。 As shown in FIG. 2, for example, the interactive robot R has a humanoid structure including a head portion R20, a body portion R30, both arm portions R40, and both foot portions R50. A microphone 20 is arranged on the forehead R20b of the head R20 of the dialogue robot R, one microphone each upward in each of the binocular portions R21 and R21. Further, one microphone 20 is arranged on the back of the head R20c. The dialogue robot R disables the acquisition of the sound (detection sound) of the TV program from the TV device T arranged behind the robot body R10, and then the human voice (detection sound) uttered from the front of the robot body R10. The case where the acquisition of is enabled will be described.

図３（ａ）に示すように、対話ロボットＲの各マイク２０がテレビ番組の音（図中に示す一点鎖線）を検出すると、制御部１０の検出音取得部１１は、テレビ番組の検出音を取得する。検出音方向特定部１２は、テレビ番組の検出音から検出音の到来方向（頭部後方、図中に示す矢印方向）を特定する。本例においては、この検出音の到来方向は後頭部Ｒ２０ｃのマイク２０で取得した音の音量差や位相差から特定する（図３（ｂ））。 As shown in FIG. 3A, when each microphone 20 of the interactive robot R detects the sound of the TV program (one-dot chain line shown in the figure), the detection sound acquisition unit 11 of the control unit 10 detects the sound of the TV program. To get. The detection sound direction specifying unit 12 specifies the arrival direction of the detected sound (behind the head, the direction of the arrow shown in the figure) from the detected sound of the television program. In this example, the direction of arrival of this detected sound is specified from the volume difference and phase difference of the sound acquired by the microphone 20 of the occipital region R20c (FIG. 3 (b)).

制御部１０は、検出音方向特定部１２からの検出音の到来方向情報に基づいて、ロボット本体Ｒ１０の胴体部Ｒ３０内の駆動部４０（例えば、駆動モーター）を動作し、頭頂部Ｒ２０ａの撮像部５０或いは頭部眉間の人感センサー部６０を検出音の到来方向に向ける。すなわち、制御部１０は、駆動部４０を駆動して連動する頭部Ｒ２０を１８０度回転することにより、頭頂部Ｒ２０ａの撮像部５０或いは頭部眉間の人感センサー部６０を検出音の到来方向に向けている（図３（ｃ））。撮像部５０は検出音の到来方向に位置するテレビ装置Ｔのテレビ画面（検出対象）を撮像する。撮像部５０は撮像して取得した撮像データを人像検出部１３に送信する。或いは、人感センサー部６０は検出音の到来方向に位置するテレビ装置Ｔのテレビ画面（検出対象）をセンシングする。人感センサー部６０はセンシングして取得したセンサー信号を人像検出部１３に送信する。 The control unit 10 operates the drive unit 40 (for example, the drive motor) in the body unit R30 of the robot body R10 based on the arrival direction information of the detected sound from the detection sound direction specifying unit 12, and images the crown R20a. The unit 50 or the human sensor unit 60 between the head and eyebrows is directed in the direction of arrival of the detection sound. That is, the control unit 10 drives the drive unit 40 and rotates the interlocking head R20 by 180 degrees to detect the image pickup unit 50 of the crown R20a or the motion sensor unit 60 between the head eyebrows in the direction of arrival of the detection sound. (Fig. 3 (c)). The imaging unit 50 images the television screen (detection target) of the television device T located in the direction of arrival of the detection sound. The imaging unit 50 transmits the imaged data acquired by imaging to the human image detection unit 13. Alternatively, the motion sensor unit 60 senses the television screen (detection target) of the television device T located in the direction of arrival of the detected sound. The motion sensor unit 60 transmits the sensor signal obtained by sensing to the human image detection unit 13.

なお、上記説明では、撮像部５０または人感センサー部６０の何れか一方を動作して撮像データ或いはセンサー信号を取得したが、撮像部５０及び人感センサー部６０の両方を動作して撮像データ及びセンサー信号を取得して人像検出部１３に送信するようにしてもよい。なお、図３（ｃ）に示すように頭部Ｒ２０の正面にはテレビ装置Ｔに対向しているが、撮像部５０の撮像したのち或いは人感センサー部６０のセンシングしたのちは駆動部４０により頭部Ｒ２０を角度１８０°回転して、元の位置に戻る（図３（ｄ））。 In the above description, either the imaging unit 50 or the motion sensor unit 60 is operated to acquire the imaging data or the sensor signal, but both the imaging unit 50 and the motion sensor unit 60 are operated to acquire the imaging data. And the sensor signal may be acquired and transmitted to the human image detection unit 13. As shown in FIG. 3C, the front surface of the head R20 faces the television device T, but after the image pickup unit 50 has taken an image or the motion sensor unit 60 has sensed, the drive unit 40 is used. The head R20 is rotated by an angle of 180 ° to return to the original position (FIG. 3 (d)).

人像検出部１３は、撮像部５０からの画像データ或いは人感センサー部６０からのセンシング信号を取得すると、画像データ或いはセンシング信号に基づいて画像データ或いはセンシング信号に人像が存在するか否かを検出する。この例において、人像検出部１３は、テレビ装置Ｔのテレビ画像を含む画像データから、人像未検出であると出力する。或いは、人像検出部１３では、このセンシング信号から人像未検出であると出力する。人像検出部１３は、人像未検出の情報を音声取得有効/無効判断部１４に送信する。なお、画像データから人像を検出する場合には、人物を認識するプログラム（いわゆる人物認識エンジン）を使用して人像を検出する。 When the human image detection unit 13 acquires the image data from the image pickup unit 50 or the sensing signal from the motion sensor unit 60, the human image detection unit 13 detects whether or not a human image exists in the image data or the sensing signal based on the image data or the sensing signal. To do. In this example, the human image detection unit 13 outputs that the human image has not been detected from the image data including the television image of the television device T. Alternatively, the human image detection unit 13 outputs from this sensing signal that the human image has not been detected. The human image detection unit 13 transmits information that the human image has not been detected to the voice acquisition valid / invalid determination unit 14. When detecting a human image from image data, a human image is detected by using a program for recognizing a person (so-called person recognition engine).

音声取得有効/無効判断部１４は、人像検出部１３からの人像未検出の情報に基づいて、検出音の到来方向における検出音（テレビ画面の音）の取得を無効と判断する（図３（ｄ））。図３（ｄ）中の点線で示す領域は音声無効化領域である。 The voice acquisition valid / invalid determination unit 14 determines that the acquisition of the detected sound (television screen sound) in the arrival direction of the detected sound is invalid based on the information from the human image detection unit 13 that the human image has not been detected (FIG. 3 (FIG. 3). d)). The area shown by the dotted line in FIG. 3D is the voice invalidation area.

ここで、図３（ｅ）に示すように、対話ロボットＲの各マイク２０が前方から人Ｐの発話の音（図中に示す一点鎖線）を検出すると、制御部１０の検出音取得部１１は、人の発話由来の検出音を取得する。検出音方向特定部１２は、検出音の到来方向（頭部前方）を特定する。 Here, as shown in FIG. 3 (e), when each microphone 20 of the dialogue robot R detects the utterance sound of the person P (one-dot chain line shown in the figure) from the front, the detection sound acquisition unit 11 of the control unit 10 Gets the detection sound derived from a person's utterance. The detection sound direction specifying unit 12 specifies the arrival direction (front of the head) of the detected sound.

制御部１０は、検出音方向特定部１２からの検出音の到来方向情報に基づいて、撮像部５０或いは人感センサー部６０を検出音の到来方向に向ける。撮像部５０は検出音の到来方向に位置する人Ｐ（検出対象）を撮像し、撮像データを人像検出部１３に送信する。或いは、人感センサー部６０は検出音の到来方向に位置する人（検出対象）をセンシングし、センサー信号を人像検出部１３に送信する。 The control unit 10 directs the image pickup unit 50 or the motion sensor unit 60 in the arrival direction of the detection sound based on the arrival direction information of the detection sound from the detection sound direction identification unit 12. The image pickup unit 50 images a person P (detection target) located in the direction of arrival of the detection sound, and transmits the image pickup data to the person image detection unit 13. Alternatively, the motion sensor unit 60 senses a person (detection target) located in the direction of arrival of the detection sound, and transmits the sensor signal to the human image detection unit 13.

人像検出部１３は、画像データから人像の存在を検出し、人検出情報を検出音取得有効／無効判断部１４に送信する。或いは、人像検出部１３は、センシング信号から人像の存在を検出し、人検出情報を検出音取得有効／無効判断部１４に送信する。 The human image detection unit 13 detects the existence of a human image from the image data, and transmits the human detection information to the detection sound acquisition valid / invalid determination unit 14. Alternatively, the human image detection unit 13 detects the presence of a human image from the sensing signal and transmits the human detection information to the detection sound acquisition valid / invalid determination unit 14.

検出音取得有効／無効判断部１４は、人検出情報に基づいて、検出音の到来方向の検出音（人の音声）の取得を有効であると判断する。検出音取得部１１は音声データの取得を継続する。 The detection sound acquisition valid / invalid determination unit 14 determines that the acquisition of the detection sound (human voice) in the direction of arrival of the detected sound is effective based on the human detection information. The detection sound acquisition unit 11 continues to acquire voice data.

対話ロボットＲの処理の流れについて、図４を用いて説明する。図４は、対話ロボットＲの処理の流れの一例を示すフローチャートである。 The processing flow of the interactive robot R will be described with reference to FIG. FIG. 4 is a flowchart showing an example of the processing flow of the interactive robot R.

複数のマイク２０，２０，２０が音を検出するまで待機状態となる（Ｓ１）。音が到来する（Ｓ２、図３（ａ））と、複数のマイク２０，２０，２０が音を検出し、検出音取得部１１が検出音をそれぞれ区別して取得し、検出音方向特定部１２が、検出音から検出音の到来方向を特定し、検出音の到来方向情報を取得できる場合（Ｓ３、ＹＥＳ、図３（ｂ））、検出音方向特定部１２からの検出音の到来方向情報に基づいて、駆動部４０を動作し、頭部Ｒ２０の頭頂部Ｒ２０ａの撮像部５０或いは頭部眉間の人感センサー部６０を検出音の到来方向に向ける（図３（ｃ））。撮像部５０は検出音の到来方向を撮像する。或いは、人感センサー部６０は検出音の到来方向をセンシングする（Ｓ４）。画像データ或いはセンシング信号は人像検出部１３に送信される。 The microphones 20, 20, 20 are in a standby state until they detect sound (S1). When a sound arrives (S2, FIG. 3A), a plurality of microphones 20, 20, 20 detect the sound, the detected sound acquisition unit 11 separately acquires the detected sound, and the detected sound direction specifying unit 12 However, when the arrival direction of the detected sound can be specified from the detected sound and the arrival direction information of the detected sound can be acquired (S3, YES, FIG. 3B), the arrival direction information of the detected sound from the detected sound direction specifying unit 12 Based on the above, the drive unit 40 is operated to direct the image pickup unit 50 of the crown R20a of the head R20 or the human sensor unit 60 between the head eyebrows in the direction of arrival of the detection sound (FIG. 3 (c)). The imaging unit 50 images the direction of arrival of the detected sound. Alternatively, the motion sensor unit 60 senses the arrival direction of the detected sound (S4). The image data or the sensing signal is transmitted to the human image detection unit 13.

人像検出部１３は、画像データ或いはセンシング信号に基づいて、人像の存在を検出できる場合（Ｓ５、ＹＥＳ）、検出音取得有効／無効判断部１４は、人像検出部１３からの人像検出の情報に基づいて、検出音の到来方向における検出音（人の音声）の取得を有効とする判断を行い、制御部１０（検出音取得部１１）は人由来の検出音である音声データを取得（Ｓ６、図３（ｅ））し、音到来の処理に戻る（Ｓ２）。 When the human image detection unit 13 can detect the presence of the human image based on the image data or the sensing signal (S5, YES), the detection sound acquisition valid / invalid determination unit 14 uses the human image detection information from the human image detection unit 13 as information. Based on this, it is determined that the acquisition of the detected sound (human voice) in the direction of arrival of the detected sound is effective, and the control unit 10 (detection sound acquisition unit 11) acquires voice data which is a human-derived detected sound (S6). , FIG. 3 (e)), and the process returns to the sound arrival process (S2).

また、人像検出部１３は、画像データ或いはセンシング信号に基づいて、人像の存在を検出できない場合（Ｓ５、ＮＯ）、検出音取得有効／無効判断部１４は、人像検出部１３からの人像未検出の情報に基づいて、検出音の到来方向における検出音の取得を無効とする判断を行い、制御部１０（検出音取得部１１）は音声データを取得せず（Ｓ７、図３（ｄ））、音到来の処理に戻る（Ｓ２）。 If the human image detection unit 13 cannot detect the presence of the human image based on the image data or the sensing signal (S5, NO), the detection sound acquisition valid / invalid determination unit 14 does not detect the human image from the human image detection unit 13. Based on the information in the above, a determination is made to invalidate the acquisition of the detected sound in the direction of arrival of the detected sound, and the control unit 10 (detection sound acquisition unit 11) does not acquire the voice data (S7, FIG. 3D). , Return to the processing of sound arrival (S2).

なお、検出音方向特定部１２が、検出音から音の到来方向を特定し、検出音の到来方向情報を取得できない場合（Ｓ３、ＮＯ）、音到来の処理に戻る（Ｓ２）。 If the detected sound direction specifying unit 12 specifies the arrival direction of the sound from the detected sound and cannot acquire the arrival direction information of the detected sound (S3, NO), the process returns to the sound arrival process (S2).

以上の処理によれば、対話ロボットＲは、明らかに人の発話でないエリアからの集音を無効化して、音声認識し易い環境を作ることができる。すなわち、対話ロボットＲは、検出音方向特定部１２が検出音から音の到来方向を特定し、検出音の到来方向情報に基づいて、駆動部４０を動作し、頭部Ｒ２０の頭頂部Ｒ２０ａの撮像部５０或いは頭部眉間の人感センサー部６０を検出音の到来方向に的確に向けることが可能となる。撮像部５０は検出音の到来方向を撮像して画像データを人像検出部１３に送信する。或いは、人感センサー部６０は検出音の到来方向をセンシングして、センシング信号を人像検出部１３に送信する。人像検出部１３は、画像データ或いはセンシング信号に基づいて、人像の存在を検出し、その人像検出の情報を検出音取得有効／無効判断部１４に送信する。検出音取得有効／無効判断部１４は、人像検出の情報に基づいて、検出音の到来方向における検出音（人の音声）の取得を有効とする判断を正確に行うことにより、制御部１０（検出音取得部１１）は人由来の検出音である音声データのみを確実に取得することが可能となる。或いは、人像検出部１３は、画像データ或いはセンシング信号に基づいて、人像の存在を検出できない場合、その人像未検出の情報を検出音取得有効／無効判断部１４に送信する。検出音取得有効／無効判断部１４は、人像検出部１３からの人像未検出の情報に基づいて、検出音の到来方向における検出音の取得を無効とする判断を正確に行うことにより、制御部１０（検出音取得部１１）は人以外の検出音を取得することがなくなるものである。 According to the above processing, the dialogue robot R can create an environment in which voice recognition is easy by disabling sound collection from an area that is clearly not spoken by a person. That is, in the dialogue robot R, the detection sound direction specifying unit 12 specifies the arrival direction of the sound from the detected sound, operates the drive unit 40 based on the arrival direction information of the detected sound, and operates the drive unit 40, and the head portion R20a of the head R20. It is possible to accurately point the image pickup unit 50 or the human sensor unit 60 between the head and eyebrows in the direction of arrival of the detection sound. The image capturing unit 50 captures the arrival direction of the detection sound and transmits the image data to the human image detecting unit 13. Alternatively, the motion sensor unit 60 senses the arrival direction of the detection sound and transmits the sensing signal to the human image detection unit 13. The human image detection unit 13 detects the existence of a human image based on the image data or the sensing signal, and transmits the information of the human image detection to the detection sound acquisition valid / invalid determination unit 14. The detection sound acquisition enable / disable determination unit 14 accurately determines that the acquisition of the detection sound (human voice) in the direction of arrival of the detection sound is valid based on the information of the human image detection, and thereby the control unit 10 ( The detection sound acquisition unit 11) can reliably acquire only voice data which is a human-derived detection sound. Alternatively, if the human image detection unit 13 cannot detect the presence of the human image based on the image data or the sensing signal, the human image detection unit 13 transmits the information that the human image has not been detected to the detection sound acquisition valid / invalid determination unit 14. The detection sound acquisition valid / invalid determination unit 14 accurately determines that the acquisition of the detected sound in the arrival direction of the detected sound is invalid based on the information that the human image has not been detected from the human image detection unit 13, and thereby the control unit. 10 (detection sound acquisition unit 11) does not acquire detection sounds other than humans.

このような処理を行うことにより各マイク２０から集音した検出音が人の発話した音声であるか或いは人以外の音であるかを確実に特定した上で、人の音声の取得を有効にできる一方、明らかに人の発話以外の音の取得を無効にできるので、音声認識の障害となり得る外部ノイズを抑制し、人が発話した音声の認識精度を向上することができる。 By performing such processing, it is possible to effectively identify whether the detected sound collected from each microphone 20 is a voice spoken by a person or a sound other than a person, and then effectively acquire the human voice. On the other hand, since it is possible to invalidate the acquisition of sounds other than those spoken by humans, it is possible to suppress external noise that may interfere with voice recognition and improve the recognition accuracy of voices spoken by humans.

［実施形態２］
本開示の実施形態２について、図５〜６を用いて説明する。なお、説明の便宜上、実施形態２のブロック図、対話ロボットの構造は、実施形態１のブロック図、対話ロボットの構造と同じであるため重複する説明を省略する。実施形態１で説明した部材と同じ機能を有する部材については、同じ符号を付記し、その説明を省略する。 [Embodiment 2]
Embodiment 2 of the present disclosure will be described with reference to FIGS. 5-6. For convenience of explanation, the block diagram of the second embodiment and the structure of the dialogue robot are the same as the block diagram and the structure of the dialogue robot of the first embodiment, so duplicate description will be omitted. The same reference numerals are added to the members having the same functions as the members described in the first embodiment, and the description thereof will be omitted.

次に、対話ロボットの具体的な動作について、図５を用いて説明する。図５は、対話ロボットの動作例を示す図である。対話ロボットＲの頭部Ｒ２０にはその両眼部Ｒ２１，Ｒ２１の上方向にそれぞれ１つずつマイク２０が配置される（図２参照）。さらに、後頭部Ｒ２０ｃには１つのマイク２０が配置されている。図５では一例として、対話ロボットＲでは、ロボット本体Ｒ１０の後方に配置したテレビ装置Ｔのテレビ番組の音の取得とロボット本体Ｒ１０の側方に配置したラジオ装置Ｑのラジオ番組の音の取得を無効とした後、ロボット本体Ｒ１０の前方から発話された人の音声の取得を有効とする場合について説明する。 Next, the specific operation of the interactive robot will be described with reference to FIG. FIG. 5 is a diagram showing an operation example of the interactive robot. A microphone 20 is arranged on the head R20 of the dialogue robot R in each of the binocular portions R21 and R21 in the upward direction (see FIG. 2). Further, one microphone 20 is arranged on the back of the head R20c. As an example in FIG. 5, in the interactive robot R, the acquisition of the sound of the TV program of the TV device T arranged behind the robot body R10 and the sound of the radio program of the radio device Q arranged on the side of the robot body R10 are acquired. A case will be described in which the acquisition of the voice of a person spoken from the front of the robot main body R10 is enabled after the invalidation.

図５（ａ）に示すように、対話ロボットＲの各マイク２０がテレビ装置Ｔのテレビ番組の音（図中に示す一点鎖線）およびラジオ装置Ｑのラジオ番組の音（図中に示す一点鎖線）を検出すると、制御部１０の検出音取得部１１は、テレビ番組の検出音およびラジオ番組の検出音を取得する。検出音方向特定部１２は、各検出音から各検出音の到来方向（頭部後方および頭部側方、図中に示す矢印方向）をそれぞれ特定する。各検出音の到来方向は、各マイク２０で取得した音の音量差や位相差から特定する（図５（ｂ））。 As shown in FIG. 5A, each microphone 20 of the interactive robot R has the sound of the TV program of the TV device T (one-point chain line shown in the figure) and the sound of the radio program of the radio device Q (one-point chain line shown in the figure). ) Is detected, the detection sound acquisition unit 11 of the control unit 10 acquires the detection sound of the TV program and the detection sound of the radio program. The detection sound direction specifying unit 12 specifies the arrival direction of each detection sound (backward and sideways of the head, arrow direction shown in the drawing) from each detection sound. The arrival direction of each detected sound is specified from the volume difference and phase difference of the sounds acquired by each microphone 20 (FIG. 5 (b)).

制御部１０は、検出音方向特定部１２からの各検出音の到来方向の情報に基づいて、胴体部Ｒ３０内の駆動部４０を動作し、頭頂部Ｒ２０ａの撮像部５０および頭部眉間の人感センサー部６０を各検出音の到来方向に向ける。 The control unit 10 operates the drive unit 40 in the body unit R30 based on the information of the arrival direction of each detected sound from the detection sound direction specifying unit 12, and operates the image pickup unit 50 of the crown R20a and the person between the head eyebrows. The motion sensor unit 60 is directed in the direction of arrival of each detection sound.

すなわち、制御部１０は、駆動部４０を駆動して連動する頭部Ｒ２０を角度１８０°回転することにより、頭頂部Ｒ２０ａの撮像部５０および頭部眉間の人感センサー部６０を検出音の到来方向に向ける（図５（ｃ））。 That is, the control unit 10 drives the drive unit 40 and rotates the interlocking head R20 by an angle of 180 ° to detect the arrival of the detection sound of the image pickup unit 50 of the crown R20a and the motion sensor unit 60 between the head eyebrows. Turn in the direction (Fig. 5 (c)).

撮像部５０は検出音の到来方向に位置するテレビ装置Ｔのテレビ画面（検出対象）を撮像する。撮像部５０は撮像して取得した撮像データを人像検出部１３に送信する。かつ、人感センサー部６０は検出音の到来方向に位置するテレビ装置Ｔのテレビ画面（検出対象）をセンシングする。人感センサー部６０はセンシングして取得したセンサー信号を人像検出部１３に送信する。 The imaging unit 50 images the television screen (detection target) of the television device T located in the direction of arrival of the detection sound. The imaging unit 50 transmits the imaged data acquired by imaging to the human image detection unit 13. In addition, the motion sensor unit 60 senses the television screen (detection target) of the television device T located in the direction of arrival of the detected sound. The motion sensor unit 60 transmits the sensor signal obtained by sensing to the human image detection unit 13.

さらに、制御部１０は、駆動部４０を駆動して連動する頭部Ｒ２０をさらに角度９０°回転することにより、頭頂部Ｒ２０ａの撮像部５０および頭部眉間の人感センサー部６０を検出音の到来方向に向ける。撮像部５０は検出音の到来方向に位置するラジオ装置Ｑ（検出対象）を撮像する。撮像部５０は撮像して取得した撮像データを人像検出部１３に送信する。かつ、人感センサー部６０は検出音の到来方向に位置するラジオ装置Ｑ（検出対象）をセンシングする。人感センサー部６０はセンシングして取得したセンサー信号を人像検出部１３に送信する。 Further, the control unit 10 further rotates the head R20, which is interlocked by driving the drive unit 40, by an angle of 90 ° to detect the image pickup unit 50 of the crown R20a and the motion sensor unit 60 between the head eyebrows. Turn to the direction of arrival. The imaging unit 50 images the radio device Q (detection target) located in the direction of arrival of the detection sound. The imaging unit 50 transmits the imaged data acquired by imaging to the human image detection unit 13. In addition, the motion sensor unit 60 senses the radio device Q (detection target) located in the direction of arrival of the detected sound. The motion sensor unit 60 transmits the sensor signal obtained by sensing to the human image detection unit 13.

なお、上記説明では、撮像部５０および人感センサー部６０を動作して撮像データおよびセンサー信号の両方を取得したが、撮像部５０の撮像データ或いは人感センサー部６０のセンサー信号の何れかを取得して人像検出部１３に送信するようにしてもよい。 In the above description, both the imaging data and the sensor signal are acquired by operating the imaging unit 50 and the motion sensor unit 60, but either the imaging data of the imaging unit 50 or the sensor signal of the motion sensor unit 60 is used. It may be acquired and transmitted to the human image detection unit 13.

なお、図５（ｃ）に示すように頭部の正面はテレビ装置Ｔに対向しているが、撮像部５０の撮像したのち或いは人感センサー部６０のセンシングしたのちは駆動部４０により頭部を回転して、元の位置に戻る（図５（ｄ））。 As shown in FIG. 5C, the front surface of the head faces the television device T, but after the image pickup unit 50 takes an image or the motion sensor unit 60 senses the head, the drive unit 40 takes the head. Is rotated to return to the original position (FIG. 5 (d)).

人像検出部１３は、撮像部５０からの各画像データおよび人感センサー部６０からの各センシング信号を取得すると、画像データおよびセンシング信号に基づいて各画像データおよび各センシング信号に人像が存在するか否かを検出する。この例において、人像検出部１３は、テレビ画像を含む画像データおよびラジオ装置Ｑを含む画像データから、いずれも人像未検出であると出力する。かつ、人像検出部１３では、テレビ画像を含むセンシング信号およびラジオ装置Ｑを含むセンシング信号から人像未検出であると出力する。人像検出部１３は、人像未検出の情報を検出音取得有効／無効判断部１４に送信する。 When the human image detection unit 13 acquires each image data from the imaging unit 50 and each sensing signal from the motion sensor unit 60, is there a human image in each image data and each sensing signal based on the image data and the sensing signal? Detect whether or not. In this example, the human image detection unit 13 outputs that the human image has not been detected from the image data including the television image and the image data including the radio device Q. In addition, the human image detection unit 13 outputs that the human image has not been detected from the sensing signal including the television image and the sensing signal including the radio device Q. The human image detection unit 13 transmits information that the human image has not been detected to the detection sound acquisition valid / invalid determination unit 14.

検出取得有効／無効判断部１４は、人像検出部１３からの人像未検出の情報に基づいて、両検出音（テレビ画面の音およびラジオ番組の音）の取得を無効とする判断を行う。制御部１０の検出音取得部１１は、両検出音の取得を中止する。図５（ｃ）中の点線で示す領域は音声無効化領域である。 The detection acquisition valid / invalid determination unit 14 determines that the acquisition of both detected sounds (television screen sound and radio program sound) is invalid based on the information from the human image detection unit 13 that the human image has not been detected. The detection sound acquisition unit 11 of the control unit 10 stops the acquisition of both detection sounds. The area shown by the dotted line in FIG. 5C is the voice invalidation area.

ここで、図５（ｄ）に示すように、対話ロボットＲの各マイク２０が前方から人の発話の音（図中に示す一点鎖線）を検出すると、制御部１０の検出音取得部１１は、人の発話由来の検出音を取得する。検出音方向特定部１２は、検出音の到来方向（頭部前方）を特定する。 Here, as shown in FIG. 5D, when each microphone 20 of the dialogue robot R detects the sound of a person's utterance (one-dot chain line shown in the figure) from the front, the detection sound acquisition unit 11 of the control unit 10 , Acquire the detection sound derived from human utterance. The detection sound direction specifying unit 12 specifies the arrival direction (front of the head) of the detected sound.

制御部１０は、検出音方向特定部１２からの検出音の到来方向の情報に基づいて、撮像部５０および人感センサー部６０を検出音の到来方向に向ける。撮像部５０は検出音の到来方向に位置する人Ｐを撮像し、撮像データを人像検出部に送信する。かつ、人感センサー部６０は検出音の到来方向に位置する人Ｐをセンシングし、センサー信号を人像検出部１３に送信する。 The control unit 10 directs the imaging unit 50 and the motion sensor unit 60 in the direction of arrival of the detected sound based on the information of the direction of arrival of the detected sound from the detection sound direction specifying unit 12. The image pickup unit 50 takes an image of the person P located in the direction of arrival of the detection sound, and transmits the image pickup data to the person image detection unit. In addition, the motion sensor unit 60 senses the person P located in the direction of arrival of the detection sound, and transmits the sensor signal to the human image detection unit 13.

人像検出部１３は、画像データから人像の存在を検出し、人検出の情報を検出音取得有効／無効判断部１４に送信する。かつ、人像検出部１３は、センシング信号から人像の存在を検出し、人検出の情報を検出音取得有効／無効判断部１４に送信する。 The human image detection unit 13 detects the existence of a human image from the image data, and transmits the human detection information to the detection sound acquisition valid / invalid determination unit 14. In addition, the human image detection unit 13 detects the presence of a human image from the sensing signal, and transmits the human detection information to the detection sound acquisition valid / invalid determination unit 14.

検出音取得有効／無効判断部１４は、人検出の情報に基づいて、検出音の到来方向の検出音（人の音声）の取得を有効とする判断を行う。制御部１０の検出音取得部１１は検出音の取得を継続する。 The detection sound acquisition enable / disable determination unit 14 determines that the acquisition of the detection sound (human voice) in the direction of arrival of the detection sound is valid based on the information of the person detection. The detection sound acquisition unit 11 of the control unit 10 continues to acquire the detection sound.

対話ロボットの処理の流れについて、図６を用いて説明する。図６は、対話ロボットの処理の流れの一例を示すフローチャートである。 The processing flow of the interactive robot will be described with reference to FIG. FIG. 6 is a flowchart showing an example of the processing flow of the interactive robot.

複数のマイク２０，２０，２０が音を検出するまで待機状態となる（Ｓ１１）。音が到来する（Ｓ１２）と、複数のマイク２０，２０，２０が音を検出し、検出音取得部１１が検出音をそれぞれ区別して取得し、検出音方向特定部１２が、検出音から音源の音の到来方向を特定し、検出音の到来方向情報を取得する場合（Ｓ１３、ＹＥＳ）、かつ、検出音の数が１つである場合（Ｓ１４、ＹＥＳ）、検出音方向特定部１２からの検出音の到来方向情報に基づいて、駆動部４０を動作し、頭部Ｒ２０の頭頂部Ｒ２０ａの撮像部５０および頭部眉間の人感センサー部６０を音の到来方向に向ける。撮像部５０は検出音の到来方向を撮像する。かつ、人感センサー部６０は音の到来方向をセンシングする（Ｓ１５）。画像データおよびセンシング信号は人像検出部１３に送信される。 It goes into a standby state until a plurality of microphones 20, 20, 20 detect sound (S11). When a sound arrives (S12), a plurality of microphones 20, 20, and 20 detect the sound, the detected sound acquisition unit 11 separately acquires the detected sound, and the detected sound direction specifying unit 12 detects the sound from the detected sound. When the arrival direction of the sound is specified and the arrival direction information of the detected sound is acquired (S13, YES), and when the number of detected sounds is one (S14, YES), the detected sound direction specifying unit 12 The drive unit 40 is operated based on the information on the arrival direction of the detected sound, and the imaging unit 50 of the crown R20a of the head R20 and the human sensor unit 60 between the head eyebrows are directed in the direction of arrival of the sound. The imaging unit 50 images the direction of arrival of the detected sound. At the same time, the motion sensor unit 60 senses the arrival direction of the sound (S15). The image data and the sensing signal are transmitted to the human image detection unit 13.

人像検出部１３は、画像データおよびセンシング信号から人像の存在を検出する場合（Ｓ１６、ＹＥＳ）、検出音取得有効／無効判断部１４は、人像検出の情報に基づいて、検出音の到来方向の検出音（人の音声）の取得を有効とする判断を行い、制御部１０（検出音取得部１１）は人由来の検出音とする音声データを取得（Ｓ１７）し、音到来の処理に戻る（Ｓ１２）。 When the human image detection unit 13 detects the presence of a human image from the image data and the sensing signal (S16, YES), the detection sound acquisition valid / invalid determination unit 14 determines the direction of arrival of the detected sound based on the human image detection information. The control unit 10 (detection sound acquisition unit 11) acquires voice data to be detected sound derived from a person (S17) after determining that the acquisition of the detected sound (human voice) is valid, and returns to the sound arrival process. (S12).

ここで、検出音が１つでない場合（Ｓ１４、ＮＯ）、検出音方向特定部１２からの各検出音の到来方向情報に基づいて、駆動部４０を動作し、頭部Ｒ２０の頭頂部Ｒ２０ａの撮像部５０および頭部眉間の人感センサー部６０を検出音の到来方向にそれぞれ向ける。撮像部５０は一方の検出音の到来方向のテレビ装置Ｔのテレビ画像（検出対象）および他方の検出音の到来方向のラジオ装置Ｑ（検出対象）をそれぞれ撮像する。かつ、人感センサー部６０は一方の検出音の到来方向のテレビ装置Ｔのテレビ画像（検出対象）および他方の検出音の到来方向のラジオ装置Ｑ（検出対象）をセンシングする。これらテレビ画像を含む画像データ、ラジオ装置Ｑを含む画像データ、テレビ画像を含むセンシング信号及びラジオ装置Ｑを含むセンシング信号は、人像検出部１３に送信される（Ｓ２１）。 Here, when there is not one detected sound (S14, NO), the driving unit 40 is operated based on the arrival direction information of each detected sound from the detected sound direction specifying unit 12, and the head portion R20a of the head R20 is operated. The imaging unit 50 and the human sensor unit 60 between the head and eyebrows are directed in the direction of arrival of the detection sound. The imaging unit 50 captures a television image (detection target) of the television device T in the direction of arrival of one detection sound and a radio device Q (detection target) in the direction of arrival of the other detection sound. In addition, the motion sensor unit 60 senses the television image (detection target) of the television device T in the direction of arrival of one detection sound and the radio device Q (detection target) in the direction of arrival of the other detection sound. The image data including the television image, the image data including the radio device Q, the sensing signal including the television image, and the sensing signal including the radio device Q are transmitted to the human image detection unit 13 (S21).

人像検出部１３は、第１の音到来方向の画像データおよびセンシング信号から人像の存在を検出する場合（Ｓ２２、ＹＥＳ）、かつ、第２の音到来方向の画像データおよびセンシング信号から人像の存在を検出する場合（Ｓ２３、ＹＥＳ）、検出音取得有効／無効判断部１４は、人像検出部１３からの人像検出の情報に基づいて、第１および第２検出音の到来方向の両検出音の取得を無効とする判断を行い、制御部１０（検出音取得部１１）は音声データを取得せず（Ｓ２４）、再度音声を取得するために音到来の処理に戻る（Ｓ１２）。 When the human image detection unit 13 detects the presence of a human image from the image data and the sensing signal in the first sound arrival direction (S22, YES), and the presence of the human image from the image data and the sensing signal in the second sound arrival direction. (S23, YES), the detection sound acquisition valid / invalid determination unit 14 determines both of the detection sounds in the arrival directions of the first and second detection sounds based on the information of the human image detection from the human image detection unit 13. After determining to invalidate the acquisition, the control unit 10 (detection sound acquisition unit 11) does not acquire the sound data (S24), and returns to the sound arrival process in order to acquire the sound again (S12).

または、人像検出部１３は、第１の検出音到来方向のテレビ画像を含む画像データおよび第１の検出音到来方向のテレビ画像を含むセンシング信号から人像の存在を検出できない場合（Ｓ２２、ＮＯ）、第２の検出音到来方向のラジオ装置Ｑを含む画像データおよびラジオ装置Ｑを含むセンシング信号から人像の存在を検出できない場合（Ｓ２５、ＮＯ）、検出音取得有効／無効判断部１４は、人像未検出の情報に基づいて、第１および第２検出音の到来方向軸Ｌ，Ｌで囲まれた領域における両検出音（テレビ画像の音、ラジオ装置の音）の取得を無効とする判断を行い（Ｓ２７、図５（ｃ））、制御部１０の検出音取得部１１は音声データを取得せず（Ｓ１７、図５（ｄ））、再度音声を取得するために音到来の処理に戻る（Ｓ１２）。例えば、第１および第２検出音の到来方向軸Ｌ，Ｌで囲まれた領域とは、第１検出音（第1の音源）の方向を基準０°±１０°で検知し、第２検出音（第２の音源）の方向を１００°±１０°で検知できた場合に、両検知した範囲の最大値と最小値から得られる。この場合、到来方向軸Ｌ，Ｌで囲まれた領域は、−１０°〜１１０°となる。 Alternatively, when the human image detection unit 13 cannot detect the presence of a human image from the image data including the television image in the first detection sound arrival direction and the sensing signal including the television image in the first detection sound arrival direction (S22, NO). When the presence of a human image cannot be detected from the image data including the radio device Q in the second detection sound arrival direction and the sensing signal including the radio device Q (S25, NO), the detection sound acquisition valid / invalid determination unit 14 determines the human image. Based on the undetected information, it is determined to invalidate the acquisition of both detected sounds (sound of TV image, sound of radio device) in the area surrounded by the arrival direction axes L and L of the first and second detected sounds. (S27, FIG. 5 (c)), the detection sound acquisition unit 11 of the control unit 10 does not acquire the sound data (S17, FIG. 5 (d)), and returns to the sound arrival process in order to acquire the sound again. (S12). For example, in the region surrounded by the arrival direction axes L and L of the first and second detected sounds, the direction of the first detected sound (first sound source) is detected at a reference of 0 ° ± 10 °, and the second detection is performed. When the direction of the sound (second sound source) can be detected at 100 ° ± 10 °, it can be obtained from the maximum and minimum values of both detected ranges. In this case, the region surrounded by the arrival direction axes L and L is −10 ° to 110 °.

第２の検出音到来方向の画像データおよびセンシング信号から人像の存在を検出する場合（Ｓ２５、ＹＥＳ）、検出音取得有効／無効判断部１４は、人像検出の情報に基づいて、第２の検出音到来方向における検出音の取得を有効とする判断を行い、制御部１０（検出音取得部１１）は音声データを取得する（Ｓ１７）。 When detecting the presence of a human image from the image data in the arrival direction of the second detection sound and the sensing signal (S25, YES), the detection sound acquisition valid / invalid determination unit 14 performs the second detection based on the information of the human image detection. The control unit 10 (detection sound acquisition unit 11) acquires voice data after determining that the acquisition of the detected sound in the sound arrival direction is effective (S17).

人像検出部１３は、画像データおよびセンシング信号から人像の存在を検出できない場合（Ｓ１６、ＮＯ）、人像検出部１３は、第２の検出音到来方向における画像データおよびセンシング信号から人像の存在を検出できない場合（Ｓ２３、ＮＯ）の場合、検出音取得有効／無効判断部１４は、人像検出部１３からの人像未検出の情報に基づいて、検出音の到来方向の検出音の取得を無効とする判断を行い、制御部１０（検出音取得部１１）は音声データを取得せず（Ｓ２６）、音到来の処理に戻る（Ｓ１２）。 When the human image detection unit 13 cannot detect the presence of a human image from the image data and the sensing signal (S16, NO), the human image detection unit 13 detects the existence of the human image from the image data and the sensing signal in the second detection sound arrival direction. If it cannot be done (S23, NO), the detection sound acquisition valid / invalid determination unit 14 invalidates the acquisition of the detection sound in the arrival direction of the detection sound based on the information that the human image has not been detected from the human image detection unit 13. After making a determination, the control unit 10 (detection sound acquisition unit 11) does not acquire voice data (S26), and returns to the sound arrival process (S12).

以上の処理によれば、対話ロボットＲは、複数の音源からの検出音でかつ、明らかに人ではない検出音の取得を、複数の音源からの検出音の到来軸で形成されるエリアは無効とすることができる。 According to the above processing, the dialogue robot R acquires the detected sound from a plurality of sound sources and is clearly not a human, and the area formed by the arrival axis of the detected sound from the plurality of sound sources is invalid. Can be.

すなわち、対話ロボットＲは、複数の検出音を検出し、検出音方向特定部１２が各検出音の到来方向を特定して、検出音方向特定部１２からの各検出音の到来方向情報に基づいて、駆動部４０を動作し、頭部Ｒ２０の頭頂部Ｒ２０ａの撮像部５０および頭部眉間の人感センサー部６０を検出音の到来方向にそれぞれ向けることができる。撮像部５０は一方の検出音の到来方向のテレビ装置Ｔのテレビ画像（検出対象）および他方の検出音の到来方向のラジオ装置Ｑ（検出対象）をそれぞれ撮像することができる。かつ、人感センサー部６０は一方の検出音の到来方向のテレビ装置Ｔのテレビ画像（検出対象）および他方の検出音の到来方向のラジオ装置Ｑ（検出対象）をセンシングすることができる。これらテレビ画像を含む画像データ、ラジオ装置Ｑを含む画像データ、テレビ画像を含むセンシング信号及びラジオ装置Ｑを含むセンシング信号は、人像検出部１３に送信される。人像検出部１３は、第１および第２の検出音到来方向のテレビ画像を含む２つの画像データおよび２つのセンシング信号から人像の存在を検出できない場合に、検出音取得有効／無効判断部１４は、人像未検出の情報に基づいて、第１および第２検出音の到来方向軸Ｌ，Ｌで囲まれた領域における両検出音（テレビ画像の音、ラジオ装置の音）の取得を無効と判断し、その後、無効化したエリア以外から検出した人の音声を有効に取得することができるので、人が発話した音声の認識精度を向上することができる。 That is, the dialogue robot R detects a plurality of detected sounds, the detected sound direction specifying unit 12 specifies the arrival direction of each detected sound, and is based on the arrival direction information of each detected sound from the detected sound direction specifying unit 12. Therefore, the drive unit 40 can be operated to direct the image pickup unit 50 of the crown R20a of the head R20 and the human sensor unit 60 between the head eyebrows in the direction of arrival of the detection sound. The imaging unit 50 can image a television image (detection target) of the television device T in the direction of arrival of one detection sound and a radio device Q (detection target) in the direction of arrival of the other detection sound. In addition, the motion sensor unit 60 can sense the television image (detection target) of the television device T in the arrival direction of one detection sound and the radio device Q (detection target) in the arrival direction of the other detection sound. The image data including the television image, the image data including the radio device Q, the sensing signal including the television image, and the sensing signal including the radio device Q are transmitted to the human image detection unit 13. When the human image detection unit 13 cannot detect the presence of the human image from the two image data including the television images in the first and second detection sound arrival directions and the two sensing signals, the human image detection unit 14 determines whether the detection sound acquisition is valid / invalid. , Based on the information that the human image has not been detected, it is determined that the acquisition of both detected sounds (sound of TV image, sound of radio device) in the area surrounded by the arrival direction axes L and L of the first and second detected sounds is invalid. After that, since the voice of the person detected from the area other than the invalidated area can be effectively acquired, the recognition accuracy of the voice spoken by the person can be improved.

また、対話ロボットＲは、複数の人の発話による複数の検出音が検出した場合には、検出音方向特定部１２が各検出音の到来方向を特定して、検出音方向特定部１２からの各検出音の到来方向情報に基づいて、駆動部４０を動作し、頭部Ｒ２０の頭頂部Ｒ２０ａの撮像部および頭部眉間の人感センサー部６０を検出音の到来方向にそれぞれ向けることができる。撮像部５０は２つ検出音の到来方向をそれぞれ撮像することができる。かつ、人感センサー部６０は２つの検出音の到来方向をセンシングすることができる。これら画像データ、センシング信号は、人像検出部１３に送信される。人像検出部１３は、第１の音到来方向の画像データおよびセンシング信号から人像の存在を検出する場合、かつ、第２の音到来方向の画像データおよびセンシング信号から人像の存在を検出する場合、検出音取得有効／無効判断部１４は、人像検出部１３からの人像検出の情報に基づいて、第１および第２検出音の到来方向の両検出音の取得を無効とする判断を行い、制御部１０（検出音取得部１１）は複数の人の音声データを取得しないこととなる。 Further, in the dialogue robot R, when a plurality of detected sounds due to speech by a plurality of people are detected, the detected sound direction specifying unit 12 specifies the arrival direction of each detected sound, and the detected sound direction specifying unit 12 sends the detected sound. Based on the arrival direction information of each detection sound, the drive unit 40 can be operated to direct the image pickup unit of the crown R20a of the head R20 and the human sensor unit 60 between the head eyebrows in the arrival direction of the detection sound. .. The imaging unit 50 can image the arrival directions of the two detection sounds, respectively. In addition, the motion sensor unit 60 can sense the arrival directions of the two detection sounds. These image data and sensing signals are transmitted to the human image detection unit 13. When the human image detection unit 13 detects the presence of a human image from the image data and the sensing signal in the first sound arrival direction, and when detecting the existence of the human image from the image data and the sensing signal in the second sound arrival direction, The detection sound acquisition enable / disable determination unit 14 determines to invalidate the acquisition of both detection sounds in the arrival directions of the first and second detection sounds based on the information of the human image detection from the human image detection unit 13, and controls the detection sound acquisition. The unit 10 (detection sound acquisition unit 11) does not acquire the voice data of a plurality of people.

［実施形態３］
本開示の実施形態３について、図７〜９を用いて説明する。なお、説明の便宜上、実施形態１で説明した部材と同じ機能を有する部材については、同じ符号を付記し、その説明を省略する。図７は、本実施形態に係る対話ロボットＲの要部構成を示すブロック図である。この対話ロボットのブロック図では音源検出部と駆動制御部の機能を追加した一方、人感センサー部を除く。 [Embodiment 3]
Embodiment 3 of the present disclosure will be described with reference to FIGS. 7-9. For convenience of explanation, the same reference numerals will be added to the members having the same functions as the members described in the first embodiment, and the description thereof will be omitted. FIG. 7 is a block diagram showing a main configuration of the dialogue robot R according to the present embodiment. In the block diagram of this interactive robot, the functions of the sound source detection unit and the drive control unit are added, while the motion sensor unit is excluded.

検出音取得部１１は、マイク２０からの検出音を取得するものである。検出音取得部１１は、複数のマイク２０からそれぞれの検出音を区別して取得する。また、検出音取得部１１は、各マイク２０の検出音を任意の長さで区切って、複数回にわたり取得する。検出音取得部１１は、検出音方向特定部１２、人像検出部１３、検出音取得有効／無効判断部１４、音源検出部１５および駆動制御部１６を含む。 The detection sound acquisition unit 11 acquires the detection sound from the microphone 20. The detection sound acquisition unit 11 distinguishes and acquires each detection sound from the plurality of microphones 20. Further, the detection sound acquisition unit 11 divides the detection sound of each microphone 20 by an arbitrary length and acquires the detection sound a plurality of times. The detection sound acquisition unit 11 includes a detection sound direction specifying unit 12, a human image detection unit 13, a detection sound acquisition valid / invalid determination unit 14, a sound source detection unit 15, and a drive control unit 16.

人像検出部１３は、検出音方向特定部１２で特定した検出音の到来方向を、後述する撮像部５０で撮像して取得した撮像データに基づいて、人像が存在するか否かを検出する。 The human image detection unit 13 detects whether or not a human image exists based on the imaging data acquired by imaging the arrival direction of the detection sound specified by the detection sound direction specifying unit 12 by the imaging unit 50 described later.

検出音取得有効／無効判断部１４は、人像検出部１３から取得した情報に基づいて、検出音が人の発話である場合を検出音の取得を有効とする判断を行い、又は検出音が人の発話でない場合を検出音の所得を無効とする判断を行う。 Based on the information acquired from the human image detection unit 13, the detection sound acquisition enable / disable determination unit 14 determines that the detection sound acquisition is valid when the detection sound is a human utterance, or the detection sound is a person. Judgment is made to invalidate the income of the detected sound when it is not the utterance of.

撮像部５０は、検出音方向特定部１２で特定した検出音の到来方向を撮像し、取得した撮像データを人像検出部１３および後述する音源検出部１５に送信する。 The imaging unit 50 images the arrival direction of the detected sound specified by the detection sound direction specifying unit 12, and transmits the acquired imaging data to the human image detecting unit 13 and the sound source detecting unit 15 described later.

音源検出部１５は、人像検出部１３からの人像の情報と撮像部５０からの撮像データとに基づいて、音源の存在を検出する。音源とは人の発話以外のテレビ装置Ｔのテレビ画像の音やラジオ装置Ｑの音等、その他機器の音等である。 The sound source detection unit 15 detects the presence of a sound source based on the information of the human image from the human image detection unit 13 and the image pickup data from the image pickup unit 50. The sound source is the sound of a television image of the television device T, the sound of the radio device Q, or the sound of other devices other than human utterances.

駆動制御部１６は、音源検出部１５からの情報に基づいて、ロボット本体Ｒ１０の向きを変えるため駆動部４０を動作制御する。 The drive control unit 16 controls the operation of the drive unit 40 in order to change the direction of the robot body R10 based on the information from the sound source detection unit 15.

駆動部４０は、撮像部５０を検出音方向特定部１２で特定した検出音の到来方向に向けるように駆動する。また、駆動部４０は駆動制御部１６からの制御情報に基づいて指定された駆動を行う。駆動部４０には、対話ロボットＲを移動する移動部４１を含む。 The drive unit 40 drives the image pickup unit 50 so as to direct the image pickup unit 50 in the arrival direction of the detection sound specified by the detection sound direction identification unit 12. Further, the drive unit 40 performs a designated drive based on the control information from the drive control unit 16. The drive unit 40 includes a moving unit 41 that moves the interactive robot R.

次に、対話ロボットＲの具体的な動作について、図８を用いて説明する。図８は、対話ロボットＲの動作例を示す図である。テレビ装置Ｔと人Ｐとが同じ方向に居る場合である。 Next, the specific operation of the dialogue robot R will be described with reference to FIG. FIG. 8 is a diagram showing an operation example of the interactive robot R. This is a case where the television device T and the person P are in the same direction.

対話ロボットＲは、ロボット本体Ｒ１０の前方からのテレビ装置Ｔのテレビ番組の音（検出音）を検知し、撮像部５０がテレビ装置Ｔの方向を撮像する。この撮像データには、テレビ装置Ｔのテレビ画面と人像が写り込んでおり、検出音がテレビ番組の音であるのか或いは人の音声であるのか判定することができない。 The dialogue robot R detects the sound (detection sound) of the TV program of the TV device T from the front of the robot main body R10, and the imaging unit 50 images the direction of the TV device T. The television screen of the television device T and a human image are reflected in the captured data, and it cannot be determined whether the detected sound is the sound of a television program or the voice of a person.

そこで、対話ロボットＲは、新たに設けた音源検出部１５からの音源検出の情報に基づいてロボット本体Ｒ１０の向きを変更し、スピーカ３０より発声して、人Ｐに移動を促した後に、新たに発話された人の音声（検出音）の取得を有効とする場合について説明する。 Therefore, the dialogue robot R changes the direction of the robot main body R10 based on the sound source detection information from the newly provided sound source detection unit 15, utters from the speaker 30, prompts the person P to move, and then newly. The case where the acquisition of the voice (detection sound) of the person spoken in is enabled will be described.

図８（ａ）に示すように、対話ロボットＲの各マイク２０がテレビ番組の音（図中に示す一点鎖線）を検出すると、制御部１０の検出音取得部１１は、テレビ番組の検出音を取得する。検出音方向特定部１２は、テレビ番組の検出音から検出音の到来方向（頭部後方、図中に示す矢印方向）を特定する。この検出音の到来方向は、マイク２０で取得した音の音量差や位相差から特定する。 As shown in FIG. 8A, when each microphone 20 of the interactive robot R detects the sound of the TV program (one-dot chain line shown in the figure), the detection sound acquisition unit 11 of the control unit 10 detects the sound of the TV program. To get. The detection sound direction specifying unit 12 specifies the arrival direction of the detected sound (behind the head, the direction of the arrow shown in the figure) from the detected sound of the television program. The arrival direction of the detected sound is specified from the volume difference and the phase difference of the sound acquired by the microphone 20.

制御部１０は、検出音方向特定部１２からの検出音の到来方向情報に基づいて、頭頂部Ｒ２０ａの撮像部５０を検出音の到来方向に向ける。撮像部５０は検出音の到来方向に位置するテレビ装置Ｔ（検出対象）とテレビ装置Ｔの後方の人Ｐ（検出対象）を撮像する（図８（ｂ））。撮像部５０は撮像して取得した撮像データを人像検出部１３および音源検出部１５に送信する。この撮像データにはテレビ装置Ｔと人像が含まれる。 The control unit 10 directs the imaging unit 50 of the crown R20a to the arrival direction of the detected sound based on the arrival direction information of the detected sound from the detection sound direction specifying unit 12. The imaging unit 50 images a television device T (detection target) located in the direction of arrival of the detection sound and a person P (detection target) behind the television device T (FIG. 8B). The imaging unit 50 transmits the imaged data acquired by imaging to the human image detection unit 13 and the sound source detection unit 15. This imaging data includes a television device T and a human image.

人像検出部１３は、撮像部５０からの画像データを取得すると、画像データに基づいて画像データに人像が存在するか否かを検出する。この例において、人像検出部１３は、テレビ装置Ｔと人像を含む画像データから、人像を検出する。人像検出部１３は、人像検出の情報を音声取得有効/無効判断部１４および音源検出部１５に送信する。 When the human image detection unit 13 acquires the image data from the imaging unit 50, the human image detection unit 13 detects whether or not a human image exists in the image data based on the image data. In this example, the human image detection unit 13 detects a human image from the television device T and image data including the human image. The human image detection unit 13 transmits the human image detection information to the voice acquisition valid / invalid determination unit 14 and the sound source detection unit 15.

音源検出部１５は、人像検出部１３からの人像検出の情報と撮像部５０からの画像データに基づいて、音源が存在するか否かを検出する。 The sound source detection unit 15 detects whether or not a sound source exists based on the information on the human image detection from the human image detection unit 13 and the image data from the image pickup unit 50.

音源検出部１５は、記憶部７０に記憶されている音源を示す音源画像データを読み出し、この音源画像データと撮像部５０からの画像データとを画像マッチングする。画像マッチングの結果、音源画像データと画像データとが一致する場合、音源が検出される。音源画像データと画像データとが一致しない場合には、音源は検出されない。音源検出部１５は、音源検出の情報を駆動制御部１６へ送信する。具体的には、音源検出部１５は、音源画像データと、テレビ装置Ｔと人像を含む撮像部５０からの画像データとを画像マッチングした結果、音源画像データと画像データとが一致するため、テレビ装置Ｔ（音源）が検出される。音源検出部１５は、検出した音源検出の情報を駆動制御部１６へ送信する。 The sound source detection unit 15 reads out sound source image data indicating a sound source stored in the storage unit 70, and image-matches the sound source image data with the image data from the imaging unit 50. As a result of image matching, if the sound source image data and the image data match, the sound source is detected. If the sound source image data and the image data do not match, the sound source is not detected. The sound source detection unit 15 transmits the sound source detection information to the drive control unit 16. Specifically, the sound source detection unit 15 matches the sound source image data with the image data from the image pickup unit 50 including the TV device T and the human image, and the sound source image data and the image data match. Device T (sound source) is detected. The sound source detection unit 15 transmits the detected sound source detection information to the drive control unit 16.

駆動制御部１６は、音源検出部１５からの音源検出の情報に基づいて、駆動部４０（移動部４１）を動作して足部Ｒ５０（図２参照）を動かしロボット本体Ｒ１０の向きを指定方向の左９０度へ変更する（図８（ｃ））。制御部１０の出力制御部１９は、スピーカ３０より人Ｐに移動を促す発話を行う。対話ロボットＲからの移動指示にしたがって移動した人Ｐは、正面のロボット本体Ｒ１０に発話を行う。検出音方向特定部１２は、検出音から音源の音の到来方向（頭部前方、図中に示す矢印方向）を特定し、検出音の到来方向情報に基づいて、頭頂部Ｒ２０ａの撮像部５０を検出音の到来方向に向ける。撮像部５０は検出音の到来方向に位置する人Ｐ（検出対象）を撮像する。撮像部５０は撮像して取得した人像を含む撮像データを人像検出部１３に送信する。なお、図８（ｃ）では、移動部４１（例えば駆動モーター）を動作して足部Ｒ５０を動かしロボット本体Ｒ１０の向きを指定方向に変更したが、駆動部４０を駆動して頭部のみを指定方向に変更するようにしてもよい。 The drive control unit 16 operates the drive unit 40 (moving unit 41) to move the foot portion R50 (see FIG. 2) based on the sound source detection information from the sound source detection unit 15, and directs the direction of the robot body R10 in the designated direction. Change to 90 degrees to the left of (Fig. 8 (c)). The output control unit 19 of the control unit 10 makes an utterance prompting the person P to move from the speaker 30. The person P who has moved according to the movement instruction from the dialogue robot R speaks to the robot main body R10 in front of the robot. The detection sound direction specifying unit 12 identifies the arrival direction of the sound of the sound source (front of the head, the direction of the arrow shown in the figure) from the detected sound, and based on the arrival direction information of the detected sound, the imaging unit 50 of the crown R20a To the direction of arrival of the detection sound. The imaging unit 50 images a person P (detection target) located in the direction of arrival of the detection sound. The imaging unit 50 transmits image data including a human image acquired by imaging to the human image detection unit 13. In FIG. 8C, the moving portion 41 (for example, the drive motor) is operated to move the foot portion R50 to change the direction of the robot body R10 in the designated direction, but the drive portion 40 is driven to drive only the head. You may change it in the designated direction.

人像検出部１３は、検出音方向特定部１２の検出音の到来方向情報と撮像部５０の撮像データに基づいて、人像を検出すると、音声取得有効/無効判断部１４に人像検出の情報を送信する。 When the human image detection unit 13 detects a human image based on the arrival direction information of the detected sound of the detection sound direction specifying unit 12 and the imaging data of the imaging unit 50, the human image detection unit 13 transmits the human image detection information to the voice acquisition valid / invalid determination unit 14. To do.

音声取得有効/無効判断部１４は、人像検出部１３からの人像検出の情報に基づいて、人の発話による検出音の到来方向における人の音声の取得を有効と判断する。検出音取得部１１は音声データの取得を継続する（図８（ｄ））。 The voice acquisition valid / invalid determination unit 14 determines that the acquisition of the human voice in the direction of arrival of the detected sound by the human utterance is effective based on the human image detection information from the human image detection unit 13. The detection sound acquisition unit 11 continues to acquire voice data (FIG. 8 (d)).

ここでは音源としてのテレビ装置Ｔに関する処理について詳説する。 Here, the processing related to the television device T as a sound source will be described in detail.

検出音方向特定部１２は、テレビ装置Ｔから検出音の到来方向を特定し、検出音の到来方向情報に基づいて、必要に応じて駆動部４０を駆動して、頭頂部Ｒ２０ａの撮像部５０を検出音の到来方向に向ける。撮像部５０は検出音の到来方向に位置するテレビ装置（検出対象）を撮像する（図８（ｂ））。撮像部５０は撮像して取得したテレビ装置Ｔを含む撮像データを人像検出部１３に送信する。 The detection sound direction specifying unit 12 identifies the arrival direction of the detected sound from the television device T, drives the drive unit 40 as necessary based on the arrival direction information of the detected sound, and drives the image pickup unit 50 of the crown R20a. To the direction of arrival of the detection sound. The imaging unit 50 images a television device (detection target) located in the direction of arrival of the detected sound (FIG. 8 (b)). The image pickup unit 50 transmits the image pickup data including the television device T that has been imaged and acquired to the human image detection unit 13.

人像検出部１３は、検出音方向特定部１２の検出音の到来方向情報と撮像部５０の撮像データに基づいて、人像を検出できず、この人像未検出の情報を音声取得有効/無効判断部１４に送信する。 The human image detection unit 13 cannot detect a human image based on the arrival direction information of the detected sound of the detected sound direction specifying unit 12 and the imaging data of the imaging unit 50, and the voice acquisition valid / invalid determination unit obtains the information that the human image has not been detected. Send to 14.

音声取得有効/無効判断部１４は、人像検出部１３からの人像未検出の情報に基づいて、テレビ装置Ｔによる検出音の到来方向における検出音の取得を無効と判断する。制御部１０の検出音取得部１１は、検出音の取得を中止する。図中の点線で示す領域は音声無効化領域である（図８（ｄ））。 The voice acquisition valid / invalid determination unit 14 determines that the acquisition of the detected sound by the television device T in the direction of arrival of the detected sound by the television device T is invalid based on the information from the human image detection unit 13 that the human image has not been detected. The detection sound acquisition unit 11 of the control unit 10 stops the acquisition of the detection sound. The area indicated by the dotted line in the figure is the voice invalidation area (FIG. 8 (d)).

対話ロボットの処理の流れについて、図９を用いて説明する。図９は、対話ロボットの処理の流れの一例を示すフローチャートである。 The processing flow of the interactive robot will be described with reference to FIG. FIG. 9 is a flowchart showing an example of the processing flow of the interactive robot.

複数のマイク２０，２０，２０が音を検出するまで待機状態となる（Ｓ３１）。音が到来する（Ｓ３２）と、複数のマイク２０，２０，２０が音を検出し、検出音取得部１１が検出音をそれぞれ区別して取得し、検出音方向特定部１２が、検出音から検出音の到来方向を特定し、検出音の到来方向情報を取得する場合（Ｓ３３、ＹＥＳ）、検出音方向特定部１２からの検出音の到来方向情報に基づいて、駆動部を動作し、頭部Ｒ２０の頭頂部Ｒ２０ａの撮像部５０を検出音の到来方向に向ける。撮像部５０は検出音の到来方向を撮像する（Ｓ３４）。画像データは人像検出部１３及び音源検出部１５に送信される。 It goes into a standby state until a plurality of microphones 20, 20, 20 detect sound (S31). When a sound arrives (S32), a plurality of microphones 20, 20, 20 detect the sound, the detected sound acquisition unit 11 separately acquires the detected sound, and the detected sound direction specifying unit 12 detects the detected sound. When specifying the arrival direction of the sound and acquiring the arrival direction information of the detected sound (S33, YES), the drive unit is operated based on the arrival direction information of the detected sound from the detected sound direction specifying unit 12 to operate the head. The imaging unit 50 of the crown R20a of R20 is directed in the direction of arrival of the detection sound. The imaging unit 50 images the direction of arrival of the detected sound (S34). The image data is transmitted to the human image detection unit 13 and the sound source detection unit 15.

人像検出部１３は、画像データに基づいて、人像の存在を検出できる場合（Ｓ３５、ＹＥＳ）、音源検出部１５は、人像検出部１３からの人像検出の情報と撮像部５０からの画像データに基づいて、記憶部７０に記憶されている音源を示す音源画像データを読み出し、この音源画像データと撮像部５０の画像データとを画像マッチングし、画像マッチングの結果、音源画像データと画像データとが一致する場合（Ｓ３６、ＹＥＳ）、駆動制御部１６は、音源検出部１５からの音源検出の情報に基づいて、駆動部４０を動作して、足部Ｒ５０を動かしロボット本体Ｒ１０の向きを指定方向の左９０度へ変更する制御を行う（Ｓ３７）。 When the human image detection unit 13 can detect the presence of a human image based on the image data (S35, YES), the sound source detection unit 15 uses the human image detection information from the human image detection unit 13 and the image data from the imaging unit 50. Based on this, the sound source image data indicating the sound source stored in the storage unit 70 is read out, the sound source image data and the image data of the imaging unit 50 are image-matched, and as a result of the image matching, the sound source image data and the image data are obtained. If they match (S36, YES), the drive control unit 16 operates the drive unit 40 based on the sound source detection information from the sound source detection unit 15 to move the foot portion R50 and direct the direction of the robot main body R10 in the designated direction. Control to change to 90 degrees to the left of (S37).

制御部１０の出力制御部１９は、スピーカ３０より人Ｐに移動を促す発話を行うように制御する。人Ｐはロボット本体Ｒ１０の正面に移動して発話を行う。検出音方向特定部１２は、人発話の検出音から検出音の到来方向を特定し、検出音の到来方向情報に基づいて、駆動部４０を駆動して、頭頂部Ｒ２０ａの撮像部５０を検出音の到来方向に向ける。撮像部５０は検出音の到来方向に位置する人（検出対象）を撮像する。撮像部５０は撮像して取得した人像を含む撮像データを人像検出部１３に送信する。 The output control unit 19 of the control unit 10 controls the speaker 30 to make an utterance prompting the person P to move. Person P moves to the front of the robot body R10 and speaks. The detection sound direction specifying unit 12 identifies the arrival direction of the detected sound from the detected sound of the human utterance, drives the driving unit 40 based on the arrival direction information of the detected sound, and detects the imaging unit 50 of the crown R20a. Turn to the direction of arrival of sound. The imaging unit 50 images a person (detection target) located in the direction of arrival of the detection sound. The imaging unit 50 transmits image data including a human image acquired by imaging to the human image detection unit 13.

人像検出部１３は、検出音方向特定部１２の検出音の到来方向情報と撮像部５０の撮像データに基づいて、人像を検出すると、人が移動したこととなる（Ｓ３８、ＹＥＳ）。音到来の処理（Ｓ３２）に戻り、Ｓ３３、Ｓ３４、Ｓ３５、ＹＥＳの処理を行い、音源検出部１５は、人像検出部１３からの人像検出の情報と撮像部５０からの画像データに基づいて、記憶部７０に記憶されている音源を示す音源画像データを読み出し、この音源画像データと撮像部５０の画像データとを画像マッチングし、画像マッチングの結果、音源画像データと画像データとが一致しない場合（Ｓ３６、ＮＯ）、音声取得有効/無効判断部１４は、人像検出部１３からの人像検出の情報に基づいて、検出音の到来方向における人の音声の取得を有効と判断する。検出音取得部１１は、人の発話の音声データの取得を継続する。 When the human image detecting unit 13 detects a human image based on the arrival direction information of the detected sound of the detected sound direction specifying unit 12 and the imaging data of the imaging unit 50, it means that the person has moved (S38, YES). Returning to the sound arrival process (S32), the processes of S33, S34, S35, and YES are performed, and the sound source detection unit 15 is based on the human image detection information from the human image detection unit 13 and the image data from the imaging unit 50. When the sound source image data indicating the sound source stored in the storage unit 70 is read out, the sound source image data and the image data of the imaging unit 50 are image-matched, and as a result of the image matching, the sound source image data and the image data do not match. (S36, NO), the voice acquisition valid / invalid determination unit 14 determines that the acquisition of the human voice in the direction of arrival of the detected sound is effective based on the human image detection information from the human image detection unit 13. The detection sound acquisition unit 11 continues to acquire voice data of human utterances.

上記Ｓ３５の処理において、人像検出部１３は、画像データに基づいて、人像の存在を検出できない場合（Ｓ３５、ＮＯ）、音声取得有効/無効判断部１４は、人像検出部１３からの人像未検出の情報に基づいて、検出音の到来方向における検出音の取得を無効と判断する（Ｓ４０）。 In the process of S35, when the human image detection unit 13 cannot detect the existence of the human image based on the image data (S35, NO), the voice acquisition valid / invalid determination unit 14 does not detect the human image from the human image detection unit 13. Based on the information in (S40), it is determined that the acquisition of the detected sound in the direction of arrival of the detected sound is invalid (S40).

上記Ｓ３８の処理において、人の移動を確認できない場合（Ｓ３８、ＮＯ）、Ｓ３８の処理に戻る。 If the movement of a person cannot be confirmed in the process of S38 (S38, NO), the process returns to the process of S38.

以上の処理によれば、対話ロボットは、人（話者）と他の音源とが同じ撮像方向にいたとしても、スピーカにより人に別方向へ移動するように促し、人と他の音源とが別方向に位置してから人の音声取得を行うことにより、明らかに人の発話でない検出音の取得を無効とし、音声認識の障害となり得る外部ノイズを抑制し、音声認識の精度を高めることが可能となる。 According to the above processing, even if the person (speaker) and the other sound source are in the same imaging direction, the dialogue robot prompts the person to move in a different direction by the speaker, and the person and the other sound source communicate with each other. By acquiring human voice after being located in a different direction, it is possible to invalidate the acquisition of detected sounds that are clearly not spoken by humans, suppress external noise that can interfere with voice recognition, and improve the accuracy of voice recognition. It will be possible.

すなわち、対話ロボットＲは、検出音を検出し、検出音方向特定部１２が検出音の到来方向を特定して、検出音方向特定部１２からの検出音の到来方向情報に基づいて、頭頂部Ｒ２０ａの撮像部５０を検出音の到来方向に向けることができる。撮像部５０はテレビ装置Ｔのテレビ画像（検出対象）および人Ｐ（検出対象）を同時に撮像することができる。テレビ画像と人像を含む画像データを含む画像データは、人像検出部１３に送信される。人像検出部１３は、人像の存在を検出する場合に、人像検出の情報を音声取得有効/無効判断部１４および音源検出部１５に送信する。音源検出部１５は、人像検出部１３からの人像検出の情報と撮像部５０からの画像データに基づいて、音源の存在を検出すると、音源検出の情報を駆動制御部１６へ送信し、駆動制御部１６は、音源検出部１５からの音源検出の情報に基づいて、駆動部４０の足部Ｒ５０を動かしロボット本体Ｒ１０の向きを指定方向へ変更する制御を行うことができる。対話ロボットＲは、スピーカ３０より人Ｐに移動を促す発話を行い、移動指示にしたがって移動した人Ｐが新たに発話した音声を有効に取得する一方で。テレビ装置Ｔによる検出音の到来方向における検出音の取得を無効とすることが可能となる。 That is, the dialogue robot R detects the detected sound, the detected sound direction specifying unit 12 specifies the arrival direction of the detected sound, and the crown portion is based on the arrival direction information of the detected sound from the detected sound direction specifying unit 12. The imaging unit 50 of R20a can be directed in the direction of arrival of the detection sound. The imaging unit 50 can simultaneously capture a television image (detection target) of the television device T and a person P (detection target). The image data including the television image and the image data including the human image is transmitted to the human image detection unit 13. When detecting the presence of a human image, the human image detection unit 13 transmits the human image detection information to the voice acquisition valid / invalid determination unit 14 and the sound source detection unit 15. When the sound source detection unit 15 detects the presence of a sound source based on the human image detection information from the human image detection unit 13 and the image data from the imaging unit 50, the sound source detection unit 15 transmits the sound source detection information to the drive control unit 16 for drive control. Based on the sound source detection information from the sound source detection unit 15, the unit 16 can control to move the foot portion R50 of the drive unit 40 to change the direction of the robot main body R10 in a designated direction. The dialogue robot R makes an utterance prompting the person P to move from the speaker 30, and effectively acquires the newly spoken voice of the person P who has moved according to the movement instruction. It is possible to invalidate the acquisition of the detected sound by the television device T in the direction of arrival of the detected sound.

［実施形態４］
本開示の実施形態４について、図１０〜１２を用いて説明する。なお、説明の便宜上、実施形態４、対話ロボットの構造は、実施形態１の対話ロボットの構造と同じであるため重複する説明を省略する。実施形態１で説明した部材と同じ機能を有する部材については、同じ符号を付記し、その説明を省略する。図１０は、本実施形態に係る対話ロボットＲの要部構成を示すブロック図である。実施形態４の対話ロボットのブロック図と上述した実施形態１の対話ロボットのブロック図の違いは、駆動制御部の機能を追加した一方、人感センサー部を除く点が異なるものである。 [Embodiment 4]
Embodiment 4 of the present disclosure will be described with reference to FIGS. 10-12. For convenience of explanation, since the structure of the dialogue robot in the fourth embodiment is the same as the structure of the dialogue robot in the first embodiment, duplicate description will be omitted. The same reference numerals are added to the members having the same functions as the members described in the first embodiment, and the description thereof will be omitted. FIG. 10 is a block diagram showing a main configuration of the dialogue robot R according to the present embodiment. The difference between the block diagram of the dialogue robot of the fourth embodiment and the block diagram of the dialogue robot of the first embodiment described above is that the function of the drive control unit is added, but the motion sensor unit is excluded.

検出音取得部１１は、マイク２０からの検出音を取得するものである。検出音取得部１１は、複数のマイク２０からそれぞれの検出音を区別して取得する。また、検出音取得部１１は、各マイク２０の検出音を任意の長さで区切って、複数回にわたり取得する。検出音取得部１１は、検出音方向特定部１２、人像検出部１３、検出音取得有効／無効判断部１４、および駆動制御部１６を含む。 The detection sound acquisition unit 11 acquires the detection sound from the microphone 20. The detection sound acquisition unit 11 distinguishes and acquires each detection sound from the plurality of microphones 20. Further, the detection sound acquisition unit 11 divides the detection sound of each microphone 20 by an arbitrary length and acquires the detection sound a plurality of times. The detection sound acquisition unit 11 includes a detection sound direction specifying unit 12, a human image detection unit 13, a detection sound acquisition valid / invalid determination unit 14, and a drive control unit 16.

人像検出部１３は、検出音方向特定部１２で特定した検出音の到来方向を、後述する撮像部５０で撮像して取得した撮像データに基づいて、人像が存在するか否かを検出する。人像検出部１３は、検出した情報を検出音取得有効／無効判断部１４および検出音方向特定部１２に送信する。 The human image detection unit 13 detects whether or not a human image exists based on the imaging data acquired by imaging the arrival direction of the detection sound specified by the detection sound direction specifying unit 12 by the imaging unit 50 described later. The human image detection unit 13 transmits the detected information to the detection sound acquisition valid / invalid determination unit 14 and the detection sound direction identification unit 12.

撮像部５０は、検出音方向特定部１２で特定した検出音の到来方向を撮像し、取得した撮像データを人像検出部１３に送信する。 The imaging unit 50 images the arrival direction of the detected sound specified by the detection sound direction specifying unit 12, and transmits the acquired imaging data to the human image detecting unit 13.

駆動制御部１６は、検出音方向特定部１２からの情報に基づいて、駆動部４０を動作制御する。 The drive control unit 16 controls the operation of the drive unit 40 based on the information from the detection sound direction specifying unit 12.

駆動部４０は、撮像部５０を検出音方向特定部１２で特定した検出音の到来方向に向けるように駆動する。また、駆動部４０は駆動制御部１６からの制御情報に基づいて指定された駆動を行う。駆動部４０には、対話ロボットＲを移動する移動部４１を含む。指定された駆動とは、例えば移動部４１が駆動されることにより、側方に所定距離移動する（図１１（ｃ））。 The drive unit 40 drives the image pickup unit 50 so as to direct the image pickup unit 50 in the arrival direction of the detection sound specified by the detection sound direction identification unit 12. Further, the drive unit 40 performs a designated drive based on the control information from the drive control unit 16. The drive unit 40 includes a moving unit 41 that moves the interactive robot R. The designated drive means, for example, that the moving unit 41 is driven to move a predetermined distance to the side (FIG. 11 (c)).

次に、対話ロボットＲの具体的な動作について、図１１を用いて説明する。図１１は、対話ロボットＲの動作例を示す図である。対話ロボットＲの正面には、ラジオ装置Ｑがあり、同対話ロボットＲの背面にはテレビ装置Ｔがある場合である。対話ロボットＲは、ロボット本体Ｒ１０の前方からのラジオ装置Ｑの音（検出音）およびロボット本体Ｒ１０の後方からのテレビ番組の音（検出音）を検知し、撮像部５０が検出音の方向をそれぞれ撮像する。この各撮像データには、ラジオ装置Ｑ、テレビ画面が写り込んでいるのみで、人は存在しない。そこで、対話ロボットＲは、検出音の無効化範囲を狭域化するためにロボット本体Ｒ１０を移動して、無効化範囲の両検出音の取得を無効とし後に、新たに発話された人Ｐの音声（検出音）の取得を有効とする場合について説明する。 Next, the specific operation of the dialogue robot R will be described with reference to FIG. FIG. 11 is a diagram showing an operation example of the interactive robot R. There is a radio device Q in front of the dialogue robot R, and a television device T in the back of the dialogue robot R. The dialogue robot R detects the sound of the radio device Q (detection sound) from the front of the robot body R10 and the sound of the TV program (detection sound) from the rear of the robot body R10, and the imaging unit 50 determines the direction of the detection sound. Take an image of each. Only the radio device Q and the television screen are reflected in each of the captured data, and there is no person. Therefore, the dialogue robot R moves the robot main body R10 in order to narrow the invalidation range of the detection sound, invalidates the acquisition of both detection sounds in the invalidation range, and then the newly spoken person P. A case where acquisition of voice (detection sound) is enabled will be described.

図１１（ａ）に示すように、対話ロボットＲの各マイク２０がロボット本体Ｒ１０の正面のラジオ番組の音（図中に示す一点鎖線）およびロボット本体Ｒ１０の後方のテレビ番組の音（図中に示す一点鎖線）をそれぞれ検出すると、制御部１０の検出音取得部１１は、ラジオ番組の検出音およびテレビ番組の検出音を取得する。検出音方向特定部１２は、各検出音から各検出音の到来方向（頭部前方および頭部後方、図中に示す２つの矢印方向）を特定する。各検出音の到来方向は、マイク２０で取得した音の音量差や位相差から特定する（図１１（ｂ））。 As shown in FIG. 11A, each microphone 20 of the interactive robot R is the sound of a radio program in front of the robot body R10 (one-point chain line shown in the figure) and the sound of a TV program behind the robot body R10 (in the figure). When each of the one-point chain lines shown in (1) is detected, the detection sound acquisition unit 11 of the control unit 10 acquires the detection sound of the radio program and the detection sound of the television program. The detection sound direction specifying unit 12 specifies the arrival direction of each detection sound (front and rear of the head, two arrow directions shown in the figure) from each detection sound. The arrival direction of each detected sound is specified from the volume difference and phase difference of the sound acquired by the microphone 20 (FIG. 11 (b)).

制御部１０は、検出音方向特定部１２からの各検出音の到来方向の情報に基づいて、胴体部Ｒ３０内の駆動部４０を動作し、頭頂部Ｒ２０ａの撮像部５０を各検出音の到来方向に向ける。 The control unit 10 operates the drive unit 40 in the body portion R30 based on the information of the arrival direction of each detection sound from the detection sound direction specifying unit 12, and the imaging unit 50 of the crown portion R20a arrives at each detection sound. Turn in the direction.

すなわち、制御部１０は、駆動部４０を駆動して連動する頭部Ｒ２０を動作することにより、頭頂部Ｒ２０ａの撮像部５０を検出音の到来方向（頭部前方および頭部後方、図中に示す矢印方向）にそれぞれ向ける。撮像部５０は検出音の到来方向に位置するラジオ装置Ｑ（検出対象）およびテレビ装置Ｔのテレビ画面（検出対象）をそれぞれ撮像する。撮像部５０は撮像して取得した撮像データを人像検出部１３に送信する。 That is, the control unit 10 drives the drive unit 40 to operate the interlocking head R20 to detect the image pickup unit 50 of the crown R20a in the direction of arrival of the detection sound (front of the head and rear of the head, in the figure). Point in the direction of the arrow shown). The imaging unit 50 images the radio device Q (detection target) and the television screen (detection target) of the television device T, which are located in the direction of arrival of the detection sound. The imaging unit 50 transmits the imaged data acquired by imaging to the human image detection unit 13.

人像検出部１３は、検出音方向特定部１２からの各検出音の到来方向の情報と撮像部５０からの各画像データに基づいて各画像データに人像が存在するか否かを検出する。この例において、人像検出部１３は、ラジオ装置Ｑを含む画像データおよびテレビ装置Ｔのテレビ画像を含む画像データから、いずれも人像未検出であると出力する。人像検出部１３は、人像未検出の情報を検出音取得有効／無効判断部１４に送信する。 The human image detection unit 13 detects whether or not a human image exists in each image data based on the information on the arrival direction of each detected sound from the detection sound direction specifying unit 12 and each image data from the imaging unit 50. In this example, the human image detection unit 13 outputs that the human image has not been detected from the image data including the radio device Q and the image data including the television image of the television device T. The human image detection unit 13 transmits information that the human image has not been detected to the detection sound acquisition valid / invalid determination unit 14.

検出取得有効／無効判断部１４は、人像検出部１３からの人像未検出の情報に基づいて、両検出音（テレビ装置の音およびラジオ装置の音）の取得を無効とする判断を行う。 The detection acquisition valid / invalid determination unit 14 determines that the acquisition of both detected sounds (sound of the television device and the sound of the radio device) is invalid based on the information that the human image has not been detected from the human image detection unit 13.

ここで、駆動制御部１６は、それぞれの検出音の到来方向が作る角度が特定の角度、例えば１５０°以上であるか否かを判断する。この例において、ロボット本体Ｒ１０の正面にはラジオ装置Ｑが配置されており、ラジオ装置Ｑの検出音の到来方向を角度０°（基準）とし、ロボット本体Ｒ１０の後方にはテレビ装置Ｔが配置されており、テレビ装置Ｔの検出音の到来方向の角度を１８０°としている。すなわち、ラジオ装置Ｑとテレビ装置Ｔとの検出音の到来方向が作る角度αが例えば、１５０°以上となる場合、駆動制御部１６からの制御情報に基づいてロボット本体Ｒ１０を移動する。ロボット本体Ｒ１０は、制御情報に基づいて移動部４１が駆動されることにより、足部Ｒ５０を動かし側方に所定距離移動する（図１１（ｃ））。 Here, the drive control unit 16 determines whether or not the angle created by the arrival direction of each detected sound is a specific angle, for example, 150 ° or more. In this example, the radio device Q is arranged in front of the robot body R10, the arrival direction of the detection sound of the radio device Q is set to an angle of 0 ° (reference), and the television device T is arranged behind the robot body R10. The angle of the arrival direction of the detected sound of the television device T is set to 180 °. That is, when the angle α created by the arrival directions of the detected sounds of the radio device Q and the television device T is, for example, 150 ° or more, the robot body R10 is moved based on the control information from the drive control unit 16. The robot body R10 moves the foot portion R50 by driving the moving portion 41 based on the control information and moves a predetermined distance to the side (FIG. 11 (c)).

この例において、ロボット本体Ｒ１０が移動したことにより、ラジオ装置Ｑとテレビ装置Ｔとの２つの検出音の到来方向軸Ｌ，Ｌが作る角度αが１５０°以下となり、検出音取得有効／無効判断部１４は、人像検出部１３からの人未検出の情報に基づいて、それぞれ検出音の到来方向の各検出音の取得を無効とする判断を行う（図１１（ｄ））。図中の各点線Ｌ，Ｌにおける範囲αは検出音を無効化した範囲を示す。 In this example, due to the movement of the robot body R10, the angles α formed by the arrival direction axes L and L of the two detection sounds of the radio device Q and the television device T become 150 ° or less, and the detection sound acquisition valid / invalid judgment is made. Based on the information from the human image detection unit 13 that has not detected a person, the unit 14 determines that the acquisition of each detected sound in the arrival direction of the detected sound is invalid (FIG. 11 (d)). The range α in each of the dotted lines L and L in the figure indicates the range in which the detected sound is invalidated.

なお、本例において、２つの検出音の到来方向軸Ｌ，Ｌが作る角度を１５０°以下の場合に検出音取得有効／無効判断部１４は、人像検出部１３からの人未検出の情報に基づいて、それぞれ検出音の到来方向の各検出音の取得を無効とする判断を行ったが、この角度は任意に変更することができ、例えば角度１２０°としてもよい。このように角度を狭めることで、無効エリアを狭める一方で、有効エリアを広く設定することができる。 In this example, when the angle formed by the arrival direction axes L and L of the two detection sounds is 150 ° or less, the detection sound acquisition valid / invalid determination unit 14 uses the human image detection unit 13 as undetected information. Based on this, it was determined that the acquisition of each detected sound in the direction of arrival of the detected sound is invalid, but this angle can be arbitrarily changed, and may be, for example, an angle of 120 °. By narrowing the angle in this way, the effective area can be set wide while the invalid area is narrowed.

図１１（ｅ）に示すように、対話ロボットＲの各マイク２０が前方から人Ｐの発話の音（図中に示す一点鎖線）を検出すると、制御部１０の検出音取得部１１は、人の発話由来の検出音を取得する。検出音方向特定部１２は、検出音の到来方向（頭部前方）を特定する。 As shown in FIG. 11 (e), when each microphone 20 of the dialogue robot R detects the utterance sound of the person P (one-dot chain line shown in the figure) from the front, the detection sound acquisition unit 11 of the control unit 10 moves the person. Acquires the detection sound derived from the utterance of. The detection sound direction specifying unit 12 specifies the arrival direction (front of the head) of the detected sound.

制御部１０は、検出音方向特定部１２からの検出音の到来方向の情報に基づいて、撮像部５０を検出音の到来方向に向ける。撮像部５０は検出音の到来方向に位置する人を撮像し、撮像データを人像検出部１３に送信する。 The control unit 10 directs the imaging unit 50 to the arrival direction of the detection sound based on the information of the arrival direction of the detection sound from the detection sound direction identification unit 12. The image pickup unit 50 images a person located in the direction of arrival of the detection sound, and transmits the image pickup data to the human image detection unit 13.

人像検出部１３は、画像データから人像の存在を検出し、人検出の情報を検出音取得有効／無効判断部１４に送信する。 The human image detection unit 13 detects the existence of a human image from the image data, and transmits the human detection information to the detection sound acquisition valid / invalid determination unit 14.

対話ロボットの処理の流れについて、図１２を用いて説明する。図１２は、対話ロボットの処理の流れの一例を示すフローチャートである。 The processing flow of the interactive robot will be described with reference to FIG. FIG. 12 is a flowchart showing an example of the processing flow of the interactive robot.

図１２に示す処理フローにおいては、それぞれの検出音の方向に人が存在するか否かにより以下の処理を実施することができる。
（１）第１および第２の検出音の到来方向の両方で人が検知された場合、第１及び第２の音声到来方向の検出音に関する音声認識を不可とする。
（２）検出音の到来方向の１方向のみで人が検知された場合、人がいないと判定された到来方向からの音声取得を無効とする。
（３）２つの検出音の到来方向それぞれで、人が検知されなかった場合、各々の検出音の到来方向軸で作成される角度を確認する。
（４）各々の検出音の到来方向軸で作成される角度が１５０°以上の場合、駆動部の足部の歩行動作により、ロボット本体（音声認識装置）を移動する。
（５）各々の検出音の到来方向軸で作成される角度が１５０°以下の場合、各々の到来方向軸で囲まれた領域で音声取得を無効とする。 In the processing flow shown in FIG. 12, the following processing can be performed depending on whether or not a person is present in the direction of each detected sound.
(1) When a person is detected in both the arrival directions of the first and second detected sounds, voice recognition regarding the detected sounds in the first and second voice arrival directions is disabled.
(2) When a person is detected in only one direction of the arrival direction of the detected sound, the voice acquisition from the arrival direction determined that there is no person is invalidated.
(3) When a person is not detected in each of the two detection sound arrival directions, the angle created by the arrival direction axis of each detection sound is confirmed.
(4) When the angle created by the arrival direction axis of each detected sound is 150 ° or more, the robot body (speech recognition device) is moved by the walking motion of the foot of the driving unit.
(5) When the angle created by the arrival direction axis of each detected sound is 150 ° or less, voice acquisition is invalidated in the area surrounded by each arrival direction axis.

以下、図１２に示すフローチャートについて詳説する。 Hereinafter, the flowchart shown in FIG. 12 will be described in detail.

複数のマイク２０が音を検出するまで待機状態となる（Ｓ１１）。音が到来する（Ｓ１２）と、複数のマイク２０が音を検出し、検出音取得部１１が検出音をそれぞれ区別して取得し、検出音方向特定部１２が、検出音から音源の音の到来方向を特定し、検出音の到来方向情報を取得する場合（Ｓ１３、ＹＥＳ）、かつ、検出音の数が１つである場合（Ｓ１４、ＹＥＳ）、検出音方向特定部１２からの検出音の到来方向情報に基づいて、駆動部４０を動作し、頭部Ｒ２０の頭頂部Ｒ２０ａの撮像部５０を音の到来方向に向ける。撮像部５０は検出音の到来方向を撮像する。（Ｓ１５）。画像データは人像検出部１３に送信される。 It goes into a standby state until the plurality of microphones 20 detect sound (S11). When a sound arrives (S12), a plurality of microphones 20 detect the sound, the detected sound acquisition unit 11 separately acquires the detected sound, and the detected sound direction specifying unit 12 arrives from the detected sound to the sound of the sound source. When the direction is specified and the arrival direction information of the detected sound is acquired (S13, YES) and the number of detected sounds is one (S14, YES), the detected sound from the detected sound direction specifying unit 12 Based on the arrival direction information, the drive unit 40 is operated to direct the image pickup unit 50 of the crown R20a of the head R20 in the arrival direction of the sound. The imaging unit 50 images the direction of arrival of the detected sound. (S15). The image data is transmitted to the human image detection unit 13.

人像検出部１３は、画像データから人像の存在を検出する場合（Ｓ１６、ＹＥＳ）、検出音取得有効／無効判断部１４は、人像検出の情報に基づいて、検出音の発生方向の検出音（人の音声）の取得を有効とする判断を行い、制御部１０（検出音取得部１１）は人由来の検出音とする音声データを取得（Ｓ１７）し、音到来の処理に戻る（Ｓ１２）。 When the human image detection unit 13 detects the presence of a human image from the image data (S16, YES), the detection sound acquisition valid / invalid determination unit 14 detects the detection sound in the generation direction of the detection sound based on the human image detection information (S16, YES). After determining that the acquisition of human voice) is valid, the control unit 10 (detection sound acquisition unit 11) acquires voice data to be detected sound derived from a person (S17), and returns to the sound arrival process (S12). ..

ここで、検出音が１つでない場合（Ｓ１４、ＮＯ）、検出音方向特定部１２からの各検出音の到来方向情報に基づいて、駆動部４０を動作し、頭部Ｒ２０の頭頂部Ｒ２０ａの撮像部５０を検出音の到来方向にそれぞれ向ける。撮像部５０は一方の検出音の到来方向のラジオ装置Ｑ（検出対象）および他方の検出音の到来方向のテレビ装置Ｔ（検出対象）をそれぞれ撮像する。これらラジオ装置Ｑを含む画像データ、テレビ装置Ｔを含む画像データは、人像検出部１３に送信される（Ｓ２１）。 Here, when there is not one detected sound (S14, NO), the driving unit 40 is operated based on the arrival direction information of each detected sound from the detected sound direction specifying unit 12, and the head portion R20a of the head R20 is operated. The imaging unit 50 is directed in the direction of arrival of the detection sound. The imaging unit 50 images the radio device Q (detection target) in the direction of arrival of one detection sound and the television device T (detection target) in the direction of arrival of the other detection sound, respectively. The image data including the radio device Q and the image data including the television device T are transmitted to the human image detection unit 13 (S21).

人像検出部１３は、第１の音到来方向の画像データから人像の存在を検出する場合（Ｓ２２、ＹＥＳ）、かつ、第２の音到来方向の画像データから人像の存在を検出する場合（Ｓ２３、ＹＥＳ）、検出音取得有効／無効判断部１４は、人像検出部１３からの人像検出の情報に基づいて、第１および第２検出音の到来方向の両検出音の取得を無効とする判断を行い、制御部１０（検出音取得部１１）は音声データを取得せず（Ｓ２４）、再度音声を取得するために音到来の処理に戻る（Ｓ１２）。 The human image detection unit 13 detects the presence of a human image from the image data in the first sound arrival direction (S22, YES) and detects the presence of a human image from the image data in the second sound arrival direction (S23). , YES), the detection sound acquisition valid / invalid determination unit 14 determines that the acquisition of both detection sounds in the arrival directions of the first and second detection sounds is invalid based on the information of the human image detection from the human image detection unit 13. The control unit 10 (detection sound acquisition unit 11) does not acquire the sound data (S24), and returns to the sound arrival process in order to acquire the sound again (S12).

また、Ｓ２２の処理において、人像検出部１３は、第１の検出音到来方向のラジオ装置Ｑを含む画像データから人像の存在を検出できない場合（Ｓ２２、ＮＯ）、第２の検出音到来方向のテレビ装置Ｔを含む画像データから人像の存在を検出できない場合（Ｓ２５、ＮＯ）である。 Further, in the processing of S22, when the human image detection unit 13 cannot detect the presence of the human image from the image data including the radio device Q in the first detection sound arrival direction (S22, NO), the human image detection unit 13 is in the second detection sound arrival direction. This is a case where the presence of a human image cannot be detected from the image data including the television device T (S25, NO).

駆動制御部１６は、それぞれの検出音の到来方向が作る角度が特定の角度、例えば１５０°以上であると判断する場合（Ｓ２５’、ＹＥＳ）、駆動制御部１６は、ロボット本体Ｒ１０の移動部４１を制御する制御情報を生成し、移動部４１に送信する。ロボット本体Ｒ１０は、制御情報に基づいて移動部４１の足部Ｒ５０が駆動されることにより、側方に所定距離移動する（Ｓ２８）。 When the drive control unit 16 determines that the angle created by the arrival direction of each detection sound is a specific angle, for example, 150 ° or more (S25', YES), the drive control unit 16 is a moving unit of the robot body R10. The control information for controlling the 41 is generated and transmitted to the moving unit 41. The robot body R10 moves a predetermined distance to the side by driving the foot portion R50 of the moving portion 41 based on the control information (S28).

Ｓ２５’の処理において、それぞれの検出音の到来方向が作る角度が特定の角度、例えば１５０°以上でないと判断する場合（Ｓ２５’、ＮＯ）、検出音取得有効／無効判断部１４は、人像未検出の情報に基づいて、第１および第２検出音の到来方向における両検出音の取得を無効とする判断を行い、制御部１０の検出音取得部１１は音声データを取得せず（Ｓ２７）、再度音声を取得するために音到来の処理に戻る（Ｓ１２）。 In the process of S25', when it is determined that the angle created by the arrival direction of each detected sound is not a specific angle, for example, 150 ° or more (S25', NO), the detection sound acquisition valid / invalid determination unit 14 does not have a human image. Based on the detection information, it is determined to invalidate the acquisition of both detected sounds in the arrival directions of the first and second detected sounds, and the detected sound acquisition unit 11 of the control unit 10 does not acquire voice data (S27). , The process returns to the sound arrival process in order to acquire the voice again (S12).

第２の検出音到来方向の画像データから人像の存在を検出する場合（Ｓ２５、ＹＥＳ）、検出音取得有効／無効判断部１４は、人像検出の情報に基づいて、第１の検出音到来方向における検出音の取得を無効とする判断を行い、制御部１０（検出音取得部１１）は音声データを取得せず（Ｓ２６）、音到来の処理に戻る（Ｓ１２）。 When the presence of a human image is detected from the image data in the second detection sound arrival direction (S25, YES), the detection sound acquisition valid / invalid determination unit 14 is based on the human image detection information, and the first detection sound arrival direction. The control unit 10 (detection sound acquisition unit 11) does not acquire the voice data (S26), and returns to the sound arrival process (S12) after determining that the acquisition of the detected sound is invalid.

人像検出部１３は、画像データから人像の存在を検出できない場合（Ｓ１６、ＮＯ）、人像検出部１３は、第２の検出音到来方向における画像データから人像の存在を検出できない場合（Ｓ２３、ＮＯ）の場合、検出音取得有効／無効判断部１４は、人像検出部１３からの人像未検出の情報に基づいて、検出音の到来方向の検出音の取得を無効とする判断を行い、制御部１０（検出音取得部１１）は音声データを取得せず（Ｓ２６）、音到来の処理に戻る（Ｓ１２）。 When the human image detection unit 13 cannot detect the presence of a human image from the image data (S16, NO), the human image detection unit 13 cannot detect the presence of a human image from the image data in the second detection sound arrival direction (S23, NO). In the case of), the detection sound acquisition valid / invalid determination unit 14 determines that the acquisition of the detection sound in the arrival direction of the detected sound is invalid based on the information that the human image detection unit 13 has not detected the human image, and the control unit 10 (detection sound acquisition unit 11) does not acquire voice data (S26), and returns to the sound arrival process (S12).

以上の処理によれば、対話ロボットＲは、複数の音源からの検出音でかつ、明らかに人ではない検出音の取得を、複数の音源からの検出音の到来方向軸で形成されるエリアが広域化している場合には、ロボット本体Ｒ１０が検出音の到来方向軸で形成されるエリアを狭域化するような方向に移動動作し、再度エリアを形成し、そのエリアの音声取得を無効とし、その後、無効化したエリア以外から検出した人の音声を有効に取得することができるので、人が発話した音声の認識精度を向上することができる。 According to the above processing, in the dialogue robot R, the area formed by the arrival direction axis of the detected sounds from the plurality of sound sources is the acquisition of the detected sounds from the plurality of sound sources and clearly not human. When the area is widened, the robot body R10 moves in a direction that narrows the area formed by the arrival direction axis of the detection sound, forms the area again, and invalidates the voice acquisition of that area. After that, since the voice of the person detected from the area other than the invalidated area can be effectively acquired, the recognition accuracy of the voice spoken by the person can be improved.

なお、対話ロボットＲは、音声の到来方向に合わせて撮像部５０や人感センサー部６０を駆動する移動部４１を含む駆動部４０を搭載していればよく、必ずしも図２に示す人型構造を備える必要はない。例えば、対話ロボットの他の構造としては、円柱構造、三角柱構造、立方体構造、直方体構造および球体構造等であってもよい。 The dialogue robot R may be equipped with a drive unit 40 including a moving unit 41 that drives the image pickup unit 50 and the motion sensor unit 60 according to the direction of arrival of the voice, and is not necessarily the humanoid structure shown in FIG. There is no need to prepare. For example, other structures of the interactive robot may be a cylindrical structure, a triangular prism structure, a cubic structure, a rectangular parallelepiped structure, a spherical structure, or the like.

上述した例では、ノイズ源をテレビ装置Ｔやラジオ装置Ｑとしているが、ノイズ源は音が到来すれば、ラジカセ、ＣＤプレーヤー、エアコン、固定電話機などの固定(設置)される機器等、音源等であっても構わないものである。 In the above example, the noise source is the TV device T or the radio device Q, but when the sound arrives, the noise source is a radio cassette player, a CD player, an air conditioner, a fixed (installed) device such as a fixed telephone, a sound source, etc. It doesn't matter.

〔変形例〕
上記各実施形態では、制御部１０は対話ロボットＲにおいて、記憶部７０、マイク２０、撮像部５０、人感センサー部６０、およびスピーカ３０と一体に構成されていたが、制御部１０と記憶部７０、マイク２０、撮像部５０、人感センサー部６０、およびスピーカ３０はそれぞれ別個の装置であり、例えば、少なくともマイク２０、撮像部５０、人感センサー部６０、駆動部は同一装置に設けて、制御部１０と記憶部７０は外部のサーバ等に設けて、同装置と同サーバとを有線または無線通信で接続されてもよい。 [Modification example]
In each of the above embodiments, the control unit 10 is integrally configured with the storage unit 70, the microphone 20, the imaging unit 50, the motion sensor unit 60, and the speaker 30 in the dialogue robot R, but the control unit 10 and the storage unit are integrated. The 70, the microphone 20, the image pickup unit 50, the motion sensor unit 60, and the speaker 30 are separate devices. For example, at least the microphone 20, the image pickup unit 50, the motion sensor unit 60, and the drive unit are provided in the same device. The control unit 10 and the storage unit 70 may be provided on an external server or the like, and the device and the server may be connected by wired or wireless communication.

例えば、対話ロボットＲは、マイク２０およびスピーカ３０と、を含んでいても良い。また、対話ロボットＲと別のサーバが制御部１０および記憶部７０を含んでいてもよい。この場合、対話ロボットＲはマイク２０の検出音をサーバに送信し、サーバからマイク２０の音の検出の停止および開始、ならびにスピーカ３０の出力に係る指示制御を受けてもよい。 For example, the dialogue robot R may include a microphone 20 and a speaker 30. Further, a server different from the dialogue robot R may include a control unit 10 and a storage unit 70. In this case, the dialogue robot R may transmit the detection sound of the microphone 20 to the server, and may receive instruction control related to the stop and start of the detection of the sound of the microphone 20 and the output of the speaker 30 from the server.

また、本開示は対話ロボットＲ以外に適用してもよい。例えば、本開示に係る各種構成を、スマートフォン、家電製品、およびパーソナルコンピュータ等において実現してもよい。 Further, the present disclosure may be applied to other than the dialogue robot R. For example, various configurations according to the present disclosure may be realized in smartphones, home appliances, personal computers, and the like.

また、対話ロボットＲは、応答を音声出力以外の方法で示してもよい。例えば、応答文テーブルに、応答として対話ロボットＲの所定の動作（ジェスチャ等）を指定する情報を予め登録しておいてもよい。そして、制御部は対話ロボットＲのモータ等を制御することで、該動作、すなわち応答をユーザに示してもよい。或いは、対話ロボットＲに液晶パネル等の表示装置を搭載し、その表示装置に応答する文を表示するようにしてもよい。 Further, the dialogue robot R may indicate the response by a method other than voice output. For example, information that specifies a predetermined operation (gesture or the like) of the dialogue robot R as a response may be registered in advance in the response statement table. Then, the control unit may show the operation, that is, the response to the user by controlling the motor or the like of the interactive robot R. Alternatively, the dialogue robot R may be equipped with a display device such as a liquid crystal panel to display a sentence in response to the display device.

［ソフトウェアによる実現例］
制御部１０の制御ブロックは、集積回路（ＩＣチップ）等に形成された論理回路（ハードウェア）によって実現してもよいし、ＣＰＵ（Central Processing Unit）を用いてソフトウェアによって実現してもよい。 [Example of realization by software]
The control block of the control unit 10 may be realized by a logic circuit (hardware) formed in an integrated circuit (IC chip) or the like, or may be realized by software using a CPU (Central Processing Unit).

後者の場合、制御部１０は、各機能を実現するソフトウェアであるプログラムの命令を実行するＣＰＵ、上記プログラムおよび各種データがコンピュータ（またはＣＰＵ）で読み取り可能に記録されたＲＯＭ（Read Only Memory）または記憶装置（これらを「記録媒体」と称する）、上記プログラムを展開するＲＡＭ（Random Access Memory）などを備えている。 In the latter case, the control unit 10 is a CPU that executes instructions of a program that is software that realizes each function, a ROM (Read Only Memory) or a ROM (Read Only Memory) in which the above program and various data are readablely recorded by a computer (or CPU). It is equipped with a storage device (referred to as a "recording medium"), a RAM (Random Access Memory) for developing the above program, and the like.

そして、コンピュータ（またはＣＰＵ）が上記プログラムを上記記録媒体から読み取って実行することにより、本発明の目的が達成される。上記記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。 Then, the object of the present invention is achieved by the computer (or CPU) reading the program from the recording medium and executing the program. As the recording medium, a "non-temporary tangible medium", for example, a tape, a disk, a card, a semiconductor memory, a programmable logic circuit, or the like can be used.

また、上記プログラムは、該プログラムを伝送可能な任意の伝送媒体（通信ネットワークや放送波等）を介して上記コンピュータに供給されてもよい。 Further, the program may be supplied to the computer via an arbitrary transmission medium (communication network, broadcast wave, etc.) capable of transmitting the program.

なお、本発明の一態様は、上記プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。 It should be noted that one aspect of the present invention can also be realized in the form of a data signal embedded in a carrier wave, in which the above program is embodied by electronic transmission.

［まとめ］
本発明の態様１に係る音声認識装置（制御部１０）は、複数のマイク（マイク２０，２０，２０）から取得した複数の検出音より音源の音の発生方向を特定する検出音方向特定部（検出音方向特定部１２）と、音源の音の発生方向を撮像して取得した撮像データ又は／及び音源の音の発生方向をセンシングして取得したセンサー信号に基づいて、人像を検出する人像検出部（人像検出部１３）と、人像検出部から取得した情報を基に、人像が確認できる場合に検出音の取得を有効とする又は人像が確認できない場合に検出音の取得を無効と判断する検出音取得有効／無効判断部（検出音取得有効／無効判断部１４）とを備える。 [Summary]
The voice recognition device (control unit 10) according to the first aspect of the present invention is a detected sound direction specifying unit that specifies a sound generation direction of a sound source from a plurality of detected sounds acquired from a plurality of microphones (mics 20, 20, 20). A human image that detects a human image based on (detected sound direction specifying unit 12) and imaging data acquired by imaging the sound generation direction of the sound source and / or a sensor signal acquired by sensing the sound generation direction of the sound source. Based on the information acquired from the detection unit (human image detection unit 13) and the human image detection unit, it is determined that the acquisition of the detection sound is valid when the human image can be confirmed, or the acquisition of the detection sound is invalid when the human image cannot be confirmed. It is provided with a detection sound acquisition valid / invalid determination unit (detection sound acquisition valid / invalid determination unit 14).

前記の構成によれば、音声認識装置は、検出音方向特定部が検出音から音源の音の発生方向を特定し、検出音の発生方向情報に基づいて、音源の音の発生方向を撮像し、又は／及び音源の音の発生方向をセンシングすることができる。人像検出部は、撮像して取得した撮像データ、又は／及びセンシングして取得したセンシング信号に基づいて、人像検出部は人像を検出し、さらに、検出音取得有効／無効判断部は、人像検出部から取得した情報を基に、検出音の取得を有効又は無効と判断するので、到来方向の検出音うち、人由来以外の検出音を取得しない一方で、人由来の検出音を取得することができるため、検出音の到来方向における音が人の発話した音声として高い精度で認識することができる。 According to the above configuration, in the voice recognition device, the detection sound direction specifying unit identifies the sound generation direction of the sound source from the detected sound, and images the sound generation direction of the sound source based on the sound generation direction information of the detected sound. Or / and the sound generation direction of the sound source can be sensed. The human image detection unit detects the human image based on the imaged data acquired by imaging and / and the sensing signal acquired by sensing, and the detection sound acquisition valid / invalid determination unit detects the human image. Based on the information acquired from the department, it is judged that the acquisition of the detected sound is valid or invalid. Therefore, among the detected sounds in the direction of arrival, the detected sounds other than those derived from humans are not acquired, while the detected sounds derived from humans are acquired. Therefore, the sound in the direction of arrival of the detected sound can be recognized with high accuracy as the voice spoken by a person.

本発明の態様２に係る音声認識装置は、前記態様１において、人像検出部からの人像の情報と撮像データとに基づいて、音源の存在を検出する音源検出部（音源検出部１５）と、を備えるようにしてもよい。 In the first aspect, the voice recognition device according to the second aspect of the present invention includes a sound source detection unit (sound source detection unit 15) that detects the presence of a sound source based on human image information from the human image detection unit and imaging data. May be provided.

音源検出部は、人像検出部からの人像の情報と撮像データとに基づいて、音源の存在を検出するので、検出音の到来方向に人と人以外の音源を確実に判別して、人由来以外の検出音を取得しない一方で、人由来の検出音を取得することができるため、音声認識の精度を向上することができ、人以外の音に対し誤って応答するという誤動作を防止できる。 Since the sound source detection unit detects the existence of the sound source based on the information of the human image from the human image detection unit and the imaging data, it reliably discriminates between a person and a non-human sound source in the direction of arrival of the detected sound, and is derived from a person. Since it is possible to acquire the detected sound derived from a human while not acquiring the detected sound other than the above, the accuracy of voice recognition can be improved, and the malfunction of erroneously responding to the sound other than the human can be prevented.

本発明の態様３に係る音声認識装置は、前記態様２において、音源検出部からの情報に基づいて、駆動部（駆動部４０）を制御する駆動制御部（駆動制御部１６）と、を備えるようにしてもよい。 The voice recognition device according to the third aspect of the present invention includes a drive control unit (drive control unit 16) that controls the drive unit (drive unit 40) based on the information from the sound source detection unit in the second aspect. You may do so.

前記の構成によれば、音源と人とが同じ方向に居る場合においても、音源検出部からの音源検出の情報に基づいて、駆動部を指定方向へ変更駆動する制御を行うことができる。 According to the above configuration, even when the sound source and the person are in the same direction, it is possible to control the drive unit to be changed and driven in the designated direction based on the sound source detection information from the sound source detection unit.

本発明の態様４に係る音声認識装置は、前記態様１において、音源の音の発生方向に、駆動部（移動部４１）を制御する駆動制御部（駆動制御部１６）と、を備えるようにしてもよい。 In the first aspect, the voice recognition device according to the fourth aspect of the present invention includes a drive control unit (drive control unit 16) that controls the drive unit (moving unit 41) in the sound generation direction of the sound source. You may.

前記の構成によれば、音源の音の発生方向に、駆動部を移動する制御を行うことができる。 According to the above configuration, it is possible to control the movement of the drive unit in the sound generation direction of the sound source.

本発明の態様５に係る電子機器（対話ロボットＲ）は、複数のマイクから取得した複数の検出音より音源の音の発生方向を特定する検出音方向特定部と、音源の音の発生方向を撮像して取得した撮像データ又は／及び音源の音の発生方向をセンシングして取得したセンサー信号に基づいて、人像を検出する人像検出部と、人像検出部から取得した情報を基に、人像が確認できる場合に検出音の取得を有効と判断する又は人像が確認できない場合に検出音の取得を無効と判断する検出音取得有効／無効判断部と、を有する音声認識装置（制御部１０）と、撮像部を前記音源の音の発生方向に駆動する駆動部（駆動部４０）と、を備える。 The electronic device (dialogue robot R) according to the fifth aspect of the present invention has a detection sound direction specifying unit that specifies a sound generation direction of a sound source from a plurality of detected sounds acquired from a plurality of microphones, and a sound generation direction of the sound source. Based on the imaged data acquired by imaging and / and the sensor signal acquired by sensing the sound generation direction of the sound source, the human image is generated based on the human image detection unit that detects the human image and the information acquired from the human image detection unit. A voice recognition device (control unit 10) having a detection sound acquisition valid / invalid judgment unit that determines that the acquisition of the detected sound is valid when it can be confirmed or that the acquisition of the detected sound is invalid when the human image cannot be confirmed. A drive unit (drive unit 40) that drives the image pickup unit in the sound generation direction of the sound source is provided.

前記の構成によれば、駆動部を動作し、撮像部を音源の音（検出音）の到来方向に的確に向けることが可能となり、前記態様１に記載の音声認識装置と同様の効果を奏する。 According to the above configuration, the drive unit can be operated to accurately direct the image pickup unit in the direction of arrival of the sound (detection sound) of the sound source, and the same effect as that of the voice recognition device according to the first aspect can be obtained. ..

本発明の態様６に係る音声認識装置の制御方法は、複数のマイクから取得した複数の検出音より音源の音の発生方向を特定する検出音方向特定ステップ（Ｓ３）と、音源の音の発生方向を撮像して取得した撮像データ又は／及び音源の音の発生方向をセンシングして取得したセンサー信号に基づいて、人像を検出する人像検出ステップ（Ｓ５）と、人像検出ステップから取得した情報を基に、人像が確認できる場合に人由来の検出音の取得を有効と判断する又は人像が確認できない場合に人由来以外の検出音の取得を無効と判断する検出音取得有効／無効判断ステップ（Ｓ６，Ｓ７）と、を含む。前記の処理によれば、前記態様１に記載の音声認識装置と同様の効果を奏する。 The control method of the voice recognition device according to the sixth aspect of the present invention includes a detection sound direction specifying step (S3) for specifying a sound generation direction of a sound source from a plurality of detected sounds acquired from a plurality of microphones, and a sound generation of the sound source. Based on the imaged data acquired by imaging the direction and / or the sensor signal acquired by sensing the sound generation direction of the sound source, the human image detection step (S5) for detecting the human image and the information acquired from the human image detection step are obtained. Based on this, the detection sound acquisition valid / invalid judgment step (which determines that the acquisition of the human-derived detection sound is valid when the human image can be confirmed, or determines that the acquisition of the non-human-derived detection sound is invalid when the human image cannot be confirmed. S6, S7) and. According to the above processing, the same effect as that of the voice recognition device according to the first aspect is obtained.

本発明の各態様に係る音声認識装置は、コンピュータによって実現してもよく、この場合には、コンピュータを上記音声認識装置が備える各部（ソフトウェア要素）として動作させることにより上記音声認識装置をコンピュータにて実現させる音声認識装置の制御プログラム、およびそれを記録したコンピュータ読み取り可能な記録媒体も、本発明の範疇に入る。 The voice recognition device according to each aspect of the present invention may be realized by a computer. In this case, the voice recognition device is made into a computer by operating the computer as each part (software element) included in the voice recognition device. The control program of the voice recognition device and the computer-readable recording medium on which the control program is recorded are also included in the scope of the present invention.

本発明は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。さらに、各実施形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成することができる。 The present invention is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and the embodiments obtained by appropriately combining the technical means disclosed in the different embodiments. Is also included in the technical scope of the present invention. Furthermore, new technical features can be formed by combining the technical means disclosed in each embodiment.

Ｒ対話ロボット（電子機器）
Ｒ１０ロボット本体
Ｒ２０頭部
Ｒ２０ａ頭頂部
Ｒ２０ｂ前頭部
Ｒ２０ｃ後頭部
Ｒ２１眼部
Ｒ３０胴体部
Ｒ４０腕部
Ｒ５０足部
１０制御部（音声認識装置）
１１検出音取得部
１２検出音方向特定部
１３人像検出部
１４検出音取得有効／無効判断部
１５音源検出部
１６駆動制御部
１８検出制御部
１９出力制御部
２０マイク
３０スピーカ
４０駆動部
４１移動部
５０撮像部
６０人感センサー部
７０記憶部
Ｐ人
Ｑラジオ装置
Ｔテレビ装置 R Dialogue robot (electronic device)
R10 Robot body R20 Head R20a Head R20b Frontal R20c Back of the head R21 Eyes R30 Body R40 Arms R50 Foot 10 Control (voice recognition device)
11 Detection sound acquisition unit 12 Detection sound direction identification unit 13 Human image detection unit 14 Detection sound acquisition valid / invalid judgment unit 15 Sound source detection unit 16 Drive control unit 18 Detection control unit 19 Output control unit 20 Microphone 30 Speaker 40 Drive unit 41 Moving unit 50 Imaging unit 60 Human sensor unit 70 Storage unit P person Q Radio device T TV device

Claims

A detection sound direction identification unit that specifies the sound generation direction of the sound source from multiple detection sounds acquired from multiple microphones,
A human image detection unit that detects a human image based on the imaging data acquired by imaging the sound generation direction of the sound source and / or the sensor signal acquired by sensing the sound generation direction of the sound source.
Based on the information acquired from the human image detection unit, it is determined that the acquisition of the plurality of detected sounds is valid when the human image can be confirmed, or the acquisition of the plurality of detected sounds is invalid when the human image cannot be confirmed. A voice recognition device including a detection sound acquisition valid / invalid judgment unit.

A sound source detection unit that detects the presence of the sound source based on the information acquired from the human image detection unit and the imaging data is provided.
The voice recognition device according to claim 1.

The voice recognition device according to claim 2, further comprising a drive control unit that controls the drive unit based on information from the sound source detection unit.

A drive control unit that controls the drive unit is provided in the sound generation direction of the sound source.
The voice recognition device according to claim 1.

With multiple microphones
Imaging unit and
With the drive unit
With the sensor
Has a control unit
The control unit identifies the sound generation direction of the sound source from multiple detected sounds acquired from multiple microphones.
A human image is detected based on the imaging data acquired by imaging the sound generation direction of the sound source and / or the signal from the sensor acquired by sensing the sound generation direction of the sound source.
Based on the information acquired from the human image detection unit, it is determined that the acquisition of the plurality of detected sounds is valid when the human image can be confirmed, or the acquisition of the plurality of detected sounds is invalid when the human image cannot be confirmed. To determine whether the detection sound acquisition is valid / invalid,
Electronics.

A detection sound direction identification step that identifies the sound generation direction of the sound source from multiple detection sounds acquired from multiple microphones,
A human image detection step for detecting a human image based on the imaging data acquired by imaging the sound generation direction of the sound source and / or the sensor signal acquired by sensing the sound generation direction of the sound source.
Based on the information acquired from the human image detection step, it is determined that the acquisition of the plurality of detected sounds is valid when the human image can be confirmed, or it is determined that the acquisition of the plurality of detected sounds is invalid when the human image cannot be confirmed. Detection sound acquisition valid / invalid judgment step and
A method of controlling a voice recognition device, including.

The control program for operating a computer as the voice recognition device according to claim 1, wherein the computer functions as the detection sound direction specifying unit, the human image detecting unit, and the detected sound acquisition valid / invalid determination unit. Control program.