JP2018045192A

JP2018045192A - Voice interactive device and method of adjusting spoken sound volume

Info

Publication number: JP2018045192A
Application number: JP2016181914A
Authority: JP
Inventors: 池野　篤司; Tokuji Ikeno; 篤司池野; 宗明島田; Muneaki Shimada; 浩太畠中; Kota HATANAKA; 西島　敏文; Toshifumi Nishijima; 敏文西島; 史憲片岡; Fuminori Kataoka; 刀根川　浩巳; Hiromi Tonegawa; 浩巳刀根川; 倫秀梅山; Norihide Umeyama
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2016-09-16
Filing date: 2016-09-16
Publication date: 2018-03-22

Abstract

PROBLEM TO BE SOLVED: To provide a voice interactive device with which it is possible to respond with an appropriate sound volume in accordance with a situation.SOLUTION: A voice interactive device comprises: volume acquisition means for acquiring the input volume of a voice inputted from a user; distance acquisition means for acquiring the distance to the user; and output volume determination means for determining the volume of a voice output to the user on the basis of the input volume acquired by the volume acquisition means and the distance acquired by the distance acquisition means. The output volume determination means has a table stored therein in which a reference input volume and a reference output volume are correlated for each distance, making it possible to compare the input volume acquired by the volume acquisition means with a reference input volume that corresponds to the distance acquired by the distance acquisition means and determine a volume, as the output volume, which is derived by adjusting the reference output volume that corresponds to the distance acquired by the distance acquisition means in accordance with the comparison result.SELECTED DRAWING: Figure 5

Description

本発明は、音声対話装置に関し、特に、音声対話装置において発話音量を決定する技術に関する。 The present invention relates to a voice interaction device, and more particularly to a technique for determining an utterance volume in a voice interaction device.

音声対話装置は、状況に応じて適切な音量で発話を行うことが求められる。特許文献１は、ステレオカメラによりユーザまでの距離を計測し、距離を加味して発話音量を調整することを開示する。具体的には、ユーザが閾値距離よりも遠くにいる場合は発話音量を大きくし、ユーザが閾値距離よりも近くにいる場合は発話音量を小さくしている。 The voice interactive apparatus is required to speak at an appropriate volume according to the situation. Patent Document 1 discloses that a distance to a user is measured with a stereo camera, and an utterance volume is adjusted in consideration of the distance. Specifically, the utterance volume is increased when the user is farther than the threshold distance, and the utterance volume is decreased when the user is closer than the threshold distance.

特許文献２は、入力された発話音量（発話音圧）から音源の距離を求められることを開示する。具体的には発話音量が小さいほど音源が遠くに位置すると考えられるので、解像度の高い画像処理を行い、発話音量が小さいほど音源が近くに位置するので解像度の低い画像処理を行うことを開示する。 Patent Document 2 discloses that the distance of the sound source can be obtained from the input speech volume (speech sound pressure). Specifically, it is considered that the sound source is located farther as the utterance volume is lower, so that high-resolution image processing is performed, and that the sound source is closer as the utterance volume is lower, so that low-resolution image processing is performed. .

特開２００８−２５４１２２号公報JP 2008-254122 A 特開２００９−１３６９６８号公報JP 2009-136968 A

特許文献１はユーザとの距離のみに応じて音声対話装置の発話音量を決定しているが、ユーザとの距離だけでなくユーザの発話音量に応じた音量で応答することが好ましい。したがって、特許文献１の手法に特許文献２の手法を組み合わせることが考えられる。 Patent Document 1 determines the utterance volume of the voice interactive apparatus according to only the distance to the user, but it is preferable to respond with a volume corresponding to not only the distance to the user but also the utterance volume of the user. Therefore, it is conceivable to combine the technique of Patent Document 1 with the technique of Patent Document 1.

しかしながら、ユーザが音声対話装置に対して発話する場合、装置の近くで発話する場合は小さな声で、遠くから発話する場合は大きな声で発話するのが自然である。したがって、音声対話装置が受けるユーザ発話の音量は同程度となり、音声対話装置の発話音量は一定になり不自然である。ユーザが遠くにいる場合に応答がユーザまで届かなかったり、ユーザが極めて近くにいる場合に過大な音量で応答してしまったりという不都合が生じる。 However, when the user speaks to the voice interactive apparatus, it is natural to speak with a small voice when speaking near the apparatus and with a loud voice when speaking from a distance. Therefore, the volume of the user utterance received by the voice interaction device is almost the same, and the utterance volume of the voice interaction device is constant and unnatural. When the user is far away, the response does not reach the user, and when the user is very close, there is a problem that the response is excessive.

本発明は、状況に応じて適切な音量で応答が可能な音声対話装置を提供することを目的とする。 An object of the present invention is to provide a voice interactive apparatus capable of responding with an appropriate volume according to the situation.

本発明にかかる音声対話装置は、ユーザから入力された音声の入力音量を取得する音量取得手段と、前記ユーザとのあいだの距離を取得する距離取得手段と、前記音量取得手段が取得した前記入力音量と前記距離取得手段が取得した前記距離に基づいて、前記ユーザに対する音声出力の音量を決定する出力音量決定手段と、を備える。 The voice interactive apparatus according to the present invention includes a volume acquisition unit that acquires an input volume of a voice input from a user, a distance acquisition unit that acquires a distance between the user and the input acquired by the volume acquisition unit. Output sound volume determination means for determining sound output sound volume for the user based on the sound volume and the distance acquired by the distance acquisition means.

本発明における前記出力音量決定手段は、基準入力音量と基準出力音量を距離に応じて定義したテーブルを記憶しており、前記音量取得手段が取得した前記入力音量と前記距離取得手段が取得した距離に対応する基準入力音量とを比較し、前記距離取得手段が取得した距離に対応する基準出力音量を前記比較結果に応じて調整した音量を、前記出力音量として決定することができる。ここで、前記テーブルに含まれる基準入力音量と基準出力音
量は、前記距離が大きいほど、基準入力音量の大きさが小さく（広義単調減少）、かつ、基準出力音量の大きさが大きい（広義単調増加）ことが好ましい。 The output volume determination means in the present invention stores a table in which a reference input volume and a reference output volume are defined according to a distance, and the input volume acquired by the volume acquisition means and the distance acquired by the distance acquisition means Is compared with the reference input volume corresponding to the distance, and the volume obtained by adjusting the reference output volume corresponding to the distance acquired by the distance acquisition unit according to the comparison result can be determined as the output volume. Here, the reference input volume and the reference output volume included in the table are such that the larger the distance is, the smaller the reference input volume is (in a monotonic decrease in a broad sense) and the larger the reference output volume (in a broad sense is monotonous). Increase).

一般に、ユーザとの距離が遠いほど大きな音量で応答し、ユーザとの距離が近いほど小さな音量で応答することが好ましい。また、ユーザの発話音量に応じた音量に応じた音量で応答することが好ましい。上記の構成によれば、ユーザとのあいだの距離とユーザから受け取る音声の入力音量の両方を考慮に入れて応答の音量（音圧）を決定しているので、適切な音量での応答が可能となる。 In general, it is preferable to respond with a larger volume as the distance to the user increases, and to respond with a smaller volume as the distance to the user is shorter. Moreover, it is preferable to respond with the volume according to the volume according to a user's speech volume. According to the above configuration, since the response volume (sound pressure) is determined taking into account both the distance to the user and the input volume of the voice received from the user, it is possible to respond with an appropriate volume. It becomes.

本発明における前記距離取得手段は、ユーザの画像を取得する画像取得手段と、前記画像からユーザの顔または身体を検出する検出手段と、前記検出手段によって検出されたユーザの顔または身体からユーザとのあいだの距離を求める距離検出手段と、を備えてもよい。本発明における前記距離取得手段は、また、距離センサであってもよい。距離センサは、レーザ、超音波、赤外線などを用いたものであってもよいし、ステレオ方式やＤＦＤ（Depth from Defocus）方式を用いたものであってもよい。 In the present invention, the distance acquisition unit includes an image acquisition unit that acquires an image of the user, a detection unit that detects a user's face or body from the image, and a user from the user's face or body detected by the detection unit. Distance detecting means for obtaining a distance between the two. The distance acquisition means in the present invention may also be a distance sensor. The distance sensor may use a laser, an ultrasonic wave, an infrared ray, or the like, or may use a stereo method or a DFD (Depth from Defocus) method.

なお、本発明は、上記手段の少なくとも一部を備える音声対話装置として捉えることもできる。本発明は、また、上記処理の少なくとも一部を実行する音声対話方法あるいは発話音量調整方法として捉えることができる。また、本発明は、この方法をコンピュータに実行させるためのコンピュータプログラム、あるいはこのコンピュータプログラムを非一時的に記憶したコンピュータ可読記憶媒体として捉えることもできる。上記手段および処理の各々は可能な限り互いに組み合わせて本発明を構成することができる。 Note that the present invention can also be understood as a voice interactive apparatus including at least a part of the above means. The present invention can also be understood as a voice interaction method or an utterance volume adjustment method for executing at least a part of the above processing. The present invention can also be understood as a computer program for causing a computer to execute this method, or a computer-readable storage medium in which this computer program is stored non-temporarily. Each of the above means and processes can be combined with each other as much as possible to constitute the present invention.

本発明によれば、音声対話装置において状況に応じて適切な音量で応答できる。 ADVANTAGE OF THE INVENTION According to this invention, it can respond with a suitable sound volume according to a situation in a voice interactive apparatus.

図１は、実施形態に係る音声対話システムのシステム構成を示す図である。FIG. 1 is a diagram illustrating a system configuration of a voice interaction system according to an embodiment. 図２は、実施形態に係る音声対話システムの機能構成を示す図である。FIG. 2 is a diagram illustrating a functional configuration of the voice interaction system according to the embodiment. 図３は、実施形態に係る音声対話システムにおける対話処理の流れの例を示す図である。FIG. 3 is a diagram illustrating an example of a flow of dialogue processing in the voice dialogue system according to the embodiment. 図４は、実施形態に係る音声対話システムにおける対話処理の流れの別の例を示す図である。FIG. 4 is a diagram illustrating another example of the flow of dialogue processing in the voice dialogue system according to the embodiment. 図５（Ａ）は距離と基準入力音量の関係、図５（Ｂ）は距離と基準出力音量の関係、図５（Ｃ）は発話音量の制御処理の流れを示すフローチャートである。5A is a flowchart showing the relationship between the distance and the reference input volume, FIG. 5B is a relationship between the distance and the reference output volume, and FIG. 5C is a flowchart showing the flow of the speech volume control process.

以下に図面を参照して、この発明の好適な実施の形態を例示的に詳しく説明する。以下で説明される実施形態は音声対話ロボットをローカルの音声対話端末として用いたシステムであるが、ローカルの音声対話端末はロボットである必要はなく任意の情報処理装置や音声対話インタフェースなどを用いることができる。 Exemplary embodiments of the present invention will be described in detail below with reference to the drawings. The embodiment described below is a system that uses a voice interactive robot as a local voice interactive terminal. However, the local voice interactive terminal does not need to be a robot, and an arbitrary information processing apparatus or a voice interactive interface is used. Can do.

＜システム構成＞
図１は本実施形態に係る音声対話システムのシステム構成を示す図であり、図２は機能構成を示す図である。本実施形態に係る音声対話システムは、図１、図２に示すように、ロボット１００、スマートフォン１１０、音声認識サーバ２００、対話サーバ３００から構成される。 <System configuration>
FIG. 1 is a diagram showing a system configuration of a voice interaction system according to the present embodiment, and FIG. 2 is a diagram showing a functional configuration. As shown in FIGS. 1 and 2, the voice interaction system according to the present embodiment includes a robot 100, a smartphone 110, a voice recognition server 200, and a dialog server 300.

ロボット（音声対話ロボット）１００は、音声入力部（マイク）１０１、画像入力部（カメラ）１０２、音声出力部（スピーカー）１０３、通信部（ＢＴ：Bluetooth（登録商
標））、コマンド送受信部１０４を含む。図示は省略しているが、ロボット１００は、可動関節（顔、腕、足等）、当該可動関節の駆動制御部、各種のライト、当該ライトの点灯・消灯などの制御部などを有している。 A robot (voice interactive robot) 100 includes a voice input unit (microphone) 101, an image input unit (camera) 102, a voice output unit (speaker) 103, a communication unit (BT: Bluetooth (registered trademark)), and a command transmission / reception unit 104. Including. Although not shown, the robot 100 has a movable joint (face, arm, foot, etc.), a drive control unit for the movable joint, various lights, a control unit for turning on / off the light, and the like. Yes.

ロボット１００は、音声入力部１０１によってユーザからの音声を取得し、画像入力部１０２によってユーザを写した画像を取得する。ロボット１００は、通信部１０５を介して入力音声と入力画像をスマートフォン１１０に送信する。ロボット１００は、スマートフォン１１０からコマンドを取得すると、それに応じて音声出力部１０３から音声を出力したり、可動関節部を駆動したりする。 The robot 100 acquires a voice from the user by the voice input unit 101 and acquires an image showing the user by the image input unit 102. The robot 100 transmits the input voice and the input image to the smartphone 110 via the communication unit 105. When the robot 100 acquires a command from the smartphone 110, the robot 100 outputs a sound from the sound output unit 103 or drives the movable joint unit accordingly.

スマートフォン１１０は、マイクロプロセッサなどの演算装置、メモリなどの記憶部、タッチスクリーンなどの入出力装置、通信装置などを含むコンピュータである。スマートフォン１００は、マイクロプロセッサがプログラムを実行することにより、入力音声処理部１１１、音声合成処理部１１２、コマンド送受信部１１３、位置情報処理部１１４、簡易応答作成部１１５、制御部１１６、通信部（ＢＴ）１１７、通信部（ＴＣＰ／ＩＰ）１１８を備える。 The smartphone 110 is a computer including an arithmetic device such as a microprocessor, a storage unit such as a memory, an input / output device such as a touch screen, a communication device, and the like. The smartphone 100 has an input speech processing unit 111, a speech synthesis processing unit 112, a command transmission / reception unit 113, a position information processing unit 114, a simple response creation unit 115, a control unit 116, a communication unit ( BT) 117 and a communication unit (TCP / IP) 118.

入力音声処理部１１１は、ロボット１００からの音声データを受け取り、通信部１１８を介して音声認識サーバ２００に送信して音声認識処理を依頼する。なお、入力音声処理部１１１が一部の前処理を行ってから、音声認識サーバ２００へ音声認識処理を依頼してもよい。入力音声処理部１１１は、音声認識サーバ２００による音声認識結果を通信部１１８を介して対話サーバへ送信し、ユーザ発話に応答する応答文のテキスト（ロボット１００に発話させる文章）の生成を依頼する。 The input voice processing unit 111 receives voice data from the robot 100 and transmits it to the voice recognition server 200 via the communication unit 118 to request voice recognition processing. Note that the speech recognition server 200 may be requested to perform speech recognition processing after the input speech processing unit 111 performs some preprocessing. The input speech processing unit 111 transmits the speech recognition result by the speech recognition server 200 to the dialogue server via the communication unit 118, and requests generation of response text (sentence to be uttered by the robot 100) in response to the user utterance. .

音声合成処理部１１２は、応答文のテキストを取得して、音声合成処理を行ってロボット１００に発話させる音声データを生成する。 The speech synthesis processing unit 112 acquires the text of the response sentence, performs speech synthesis processing, and generates speech data that causes the robot 100 to speak.

位置情報処理部１１４は、ＧＰＳにより測位した位置情報や日時情報を履歴として保持する。 The position information processing unit 114 holds position information and date / time information measured by GPS as a history.

簡易応答作成部１１５は、ユーザ発話をロボット１００から受け取って、音声認識サーバ２００および対話サーバ３００による応答を返すまでのあいだに、相づちや返事あるいは入力された音声データの繰り返しのような簡易的な応答を作成して、ロボット１００から出力させる。 The simple response creation unit 115 receives a user's utterance from the robot 100 and returns a response from the voice recognition server 200 and the dialogue server 300 until a simple response such as matching or replying or repetition of input voice data. A response is created and output from the robot 100.

制御部１１６は、スマートフォン１１０の全体的な処理を司る。通信部１１７は、Ｂｌｕｅｔｏｏｔｈ（登録商標）規格にしたがって、ロボット１００とのあいだで通信を行う。通信部１１８は、ＴＣＰ／ＩＰ規格にしたがって音声認識サーバ２００や対話サーバ３００とのあいだで通信を行う。 The control unit 116 governs overall processing of the smartphone 110. The communication unit 117 performs communication with the robot 100 in accordance with the Bluetooth (registered trademark) standard. The communication unit 118 performs communication with the voice recognition server 200 and the dialogue server 300 according to the TCP / IP standard.

音声認識サーバ２００は、マイクロプロセッサなどの演算装置、メモリ、通信装置などを含むコンピュータであり、通信部２０１および音声認識処理部２０２を備える。音声認識サーバ２００は、豊富な資源を有しており、高精度な音声認識が可能である。 The voice recognition server 200 is a computer that includes an arithmetic device such as a microprocessor, a memory, a communication device, and the like, and includes a communication unit 201 and a voice recognition processing unit 202. The speech recognition server 200 has abundant resources and can perform highly accurate speech recognition.

対話サーバ３００は、マイクロプロセッサなどの演算装置、メモリ、通信装置などを含むコンピュータであり、通信部３０１、応答作成部３０２、情報記憶部３０３を備える。情報記憶部３０３には、応答作成のための対話シナリオが格納される。応答作成部３０２は、情報記憶部３０３の対話シナリオを参照して、ユーザ発話に対する応答を作成する。対話サーバ３００は、豊富な資源（高速な演算部や、大容量の対話シナリオＤＢなど）を有しており、高度な応答を生成可能である。 The dialogue server 300 is a computer including an arithmetic device such as a microprocessor, a memory, a communication device, and the like, and includes a communication unit 301, a response creation unit 302, and an information storage unit 303. The information storage unit 303 stores a dialogue scenario for creating a response. The response creation unit 302 creates a response to the user utterance with reference to the dialogue scenario in the information storage unit 303. The dialogue server 300 has abundant resources (such as a high-speed computing unit and a large-capacity dialogue scenario DB), and can generate sophisticated responses.

＜全体処理＞
図３を参照して、本実施形態に係る音声対話システムにおける全体的な処理の流れを説明する。 <Overall processing>
With reference to FIG. 3, an overall processing flow in the voice interaction system according to the present embodiment will be described.

ステップＳ１１において、ロボット１００が音声入力部１０１からユーザの発話の音声の入力を受けると、ロボット１１０は通信部１０５を介して入力音声データをスマートフォン１１０の入力音声処理部１１１に送信し、入力音声処理部１１１が当該入力音声データを音声認識サーバ２００へ送信する。 In step S <b> 11, when the robot 100 receives an input of the user's utterance voice from the voice input unit 101, the robot 110 transmits the input voice data to the input voice processing unit 111 of the smartphone 110 via the communication unit 105. The processing unit 111 transmits the input voice data to the voice recognition server 200.

ステップＳ１２において、音声認識サーバ２００の音声認識処理部２０２が音声認識処理を実施する。 In step S12, the speech recognition processing unit 202 of the speech recognition server 200 performs speech recognition processing.

ステップＳ１３において、スマートフォン１１０の入力音声処理部１１１が音声認識サーバ２００による認識結果を取得するとともに、位置情報処理部１１４がＧＰＳから位置情報を取得する。入力音声処理部１１１は、音声認識結果および位置情報を対話サーバ３００へ送信して、応答文の作成を依頼する。なお、ここでは音声認識結果をスマートフォン１１０を介して音声認識サーバ２００から対話サーバ３００へ送っているが、音声認識結果は音声認識サーバ２００から対話サーバ３００へ直接送られてもよい。 In step S13, the input voice processing unit 111 of the smartphone 110 acquires the recognition result by the voice recognition server 200, and the position information processing unit 114 acquires position information from the GPS. The input speech processing unit 111 transmits a speech recognition result and position information to the dialogue server 300 and requests creation of a response sentence. Here, the voice recognition result is sent from the voice recognition server 200 to the dialog server 300 via the smartphone 110, but the voice recognition result may be sent directly from the voice recognition server 200 to the dialog server 300.

ステップＳ１４において、対話サーバ３００の応答作成部３０２は、音声認識結果に対する応答のテキストを生成する。この際、情報記憶部３０３に記憶されている対話シナリオを参照する。対話サーバ３００によって生成された応答文テキストはスマートフォン１１０に送信される。 In step S <b> 14, the response creation unit 302 of the dialogue server 300 generates a response text for the speech recognition result. At this time, the dialogue scenario stored in the information storage unit 303 is referred to. The response text generated by the dialog server 300 is transmitted to the smartphone 110.

ステップＳ１５において、対話サーバ３００は、受け取ったユーザ発話の内容を情報記憶部３０３に記憶し、ステップＳ１６において、受け取った位置情報を情報記憶部３０３に記憶する。どこでどのような発話がなされたのかをユーザごとに情報記憶部３０３に記憶しておくことで、将来の応答文作成に活用できる。 In step S15, the dialogue server 300 stores the content of the received user utterance in the information storage unit 303, and in step S16, the received location information is stored in the information storage unit 303. By storing where and what utterances are made in the information storage unit 303 for each user, it can be used for future response sentence creation.

ステップＳ１７において、スマートフォン１１０が対話サーバ３００から応答文テキストを受信すると、音声合成処理部１１２が音声合成処理により応答文テキストの音声データを生成する。コマンド送受信部１１３は、当該音声データを出力するように、ロボット１００に対してコマンドを送信する。 In step S17, when the smartphone 110 receives the response text from the dialogue server 300, the speech synthesis processing unit 112 generates speech data of the response text by the speech synthesis processing. The command transmission / reception unit 113 transmits a command to the robot 100 so as to output the voice data.

ステップＳ１８において、ロボット１００のコマンド送受信部１０４がスマートフォン１１０からのコマンドを受信して、音声出力部１０３から応答の音声データを出力する。 In step S <b> 18, the command transmission / reception unit 104 of the robot 100 receives a command from the smartphone 110, and outputs response voice data from the voice output unit 103.

なお、上記の処理にはある程度の時間が必要であり、そのあいだロボット１００が応答発話しないでいると、ユーザとの対話が不自然に間延びしてしまう。そこで、スマート１１０は上記の処理を行っているあいだに、簡易応答作成部１１５からロボット１００に対して、相づちや返事、入力された音声データの繰り返しなどの簡易的な応答のための音声データを作成し、ロボット１００に送信して応答の発話を行わせる。また、スマートフォン１１０の位置情報を対話に利用して対話の幅を拡げることもできる。 It should be noted that a certain amount of time is required for the above processing, and if the robot 100 does not make a response during that time, the dialogue with the user is unnaturally extended. Therefore, while the smart 110 is performing the above processing, the simple response creation unit 115 sends voice data for a simple response to the robot 100 such as matching, replying, and repetition of the input voice data. It is created and transmitted to the robot 100 to make a response utterance. Further, the position information of the smartphone 110 can be used for the dialogue to expand the range of the dialogue.

＜発話音量調整方法＞
図４は、本実施形態に係る音声対話システムにおいてロボット１００が応答する際に行う発話音量調整（決定）方法を説明するフローチャートである。 <Speech volume adjustment method>
FIG. 4 is a flowchart for explaining an utterance volume adjustment (determination) method performed when the robot 100 responds in the voice interaction system according to the present embodiment.

ユーザがロボット１００に向かって発話すると、音声入力部（マイク）１０１がユーザ
の音声を取得する（Ｓ２１）とともに、画像入力部（カメラ）１０２がユーザの画像を取得する。ユーザの存在する方向は、マイクアレイへの音波到来時間の差によって把握してもよいし、また、その他の技術によって把握してもよい。 When the user speaks toward the robot 100, the voice input unit (microphone) 101 acquires the user's voice (S21), and the image input unit (camera) 102 acquires the user's image. The direction in which the user exists may be grasped by a difference in arrival time of sound waves to the microphone array, or may be grasped by other techniques.

ユーザ発話の音声データはスマートフォン１１０に送られ、スマートフォン１１０は、音声データに対する音声認識処理と応答作成処理を実施する（Ｓ２２）。この処理は、図３を用いて説明したので繰り返さない。 The voice data of the user utterance is sent to the smartphone 110, and the smartphone 110 performs voice recognition processing and response creation processing on the voice data (S22). Since this process has been described with reference to FIG. 3, it will not be repeated.

ステップＳ２３において、ロボット１００の音声入力部１０１は、ユーザ音声の音量（音圧）を検出する。 In step S23, the voice input unit 101 of the robot 100 detects the volume (sound pressure) of the user voice.

ロボット１００の画像入力部１０２は、ステップＳ２５において画像からユーザの顔を検出し、ステップＳ２６において顔画像のサイズを取得する。ここでは撮影画像から顔を抽出しているが、身体を抽出してもよい。 The image input unit 102 of the robot 100 detects the user's face from the image in step S25, and acquires the size of the face image in step S26. Although the face is extracted from the photographed image here, the body may be extracted.

ステップＳ２３において取得されたユーザ発話の音量およびステップＳ２６において取得された顔画像の大きさは、ロボット１００からスマートフォン１１０に送信される。ステップＳ２７において、スマートフォン１１０の制御部１１６は、音量および顔サイズに基づいてユーザの位置、すなわちロボット１００とユーザとのあいだの距離が算出される。 The volume of the user utterance acquired in step S23 and the size of the face image acquired in step S26 are transmitted from the robot 100 to the smartphone 110. In step S27, the control unit 116 of the smartphone 110 calculates the position of the user, that is, the distance between the robot 100 and the user based on the volume and the face size.

ステップＳ２８において、制御部１１６は、ユーザ発話の音量およびユーザとのあいだの距離に基づいて、ロボット１００がユーザ発話に応答する際の発話音量を決定する。この決定処理の詳細について、図５（Ａ）〜図５（Ｃ）を参照して説明する。 In step S28, the control unit 116 determines the utterance volume when the robot 100 responds to the user utterance based on the volume of the user utterance and the distance between the user and the user. Details of this determination process will be described with reference to FIGS. 5 (A) to 5 (C).

図５（Ａ）は、スマートフォン１１０があらかじめ格納している、ユーザとロボットのあいだの距離と基準入力音量との関係を表すテーブルである。基準入力音量は、ユーザが通常程度の音量で発話した際に、ロボット１００に入力されると想定される音量である。ユーザは一般にロボットから離れるほど大きな声で発話する傾向にあるが、やはり距離が大きいほど入力音量は小さくなることが想定される。したがって、基準入力音量は、ユーザとロボットのあいだの距離が大きくなるほど小さいように設定される。 FIG. 5A is a table representing the relationship between the distance between the user and the robot and the reference input volume, which is stored in the smartphone 110 in advance. The reference input volume is a volume that is assumed to be input to the robot 100 when the user speaks at a normal volume. The user generally tends to speak louder as he gets away from the robot, but it is assumed that the input volume decreases as the distance increases. Therefore, the reference input volume is set so as to decrease as the distance between the user and the robot increases.

図５（Ｂ）は、スマートフォン１１０があらかじめ格納している、ユーザとロボットのあいだの距離と基準出力音量との関係を表すテーブルである。基準出力音量は、ロボット１００が発話する際の音量の基準となる音量である。ユーザとロボットのあいだの距離が離れるほど、ロボット１００は大きな音量で発話する必要がある。したがって、基準出力音量は、ユーザとロボットのあいだの距離が大きくなるほど大きいように設定される。 FIG. 5B is a table representing the relationship between the distance between the user and the robot and the reference output volume, which is stored in the smartphone 110 in advance. The reference output volume is a volume that serves as a reference for the volume when the robot 100 speaks. As the distance between the user and the robot increases, the robot 100 needs to speak at a louder volume. Therefore, the reference output volume is set so as to increase as the distance between the user and the robot increases.

なお、図５（Ａ），５（Ｂ）では、基準入力音量や基準出力音量が距離に応じて線形に変化するように示しているが、これらの図は距離と音量の関係を例示的に説明するものに過ぎず、必ずしも音量と距離が線形の関係でなくても構わない。また、これらの図では基準の音量が距離に応じた狭義の単調減少／増加関数で表されているが、広義の単調減少／増加関数であってもよい。例えば、階段関数などを採用してもよい。 In FIGS. 5A and 5B, the reference input volume and the reference output volume are shown to change linearly according to the distance. However, these diagrams illustrate the relationship between the distance and the volume. This is merely an explanation, and the volume and distance do not necessarily have a linear relationship. In these figures, the reference volume is represented by a monotonic decrease / increase function in a narrow sense according to distance, but it may be a monotone decrease / increase function in a broad sense. For example, a step function or the like may be employed.

図５（Ｃ）は、ステップＳ２８の発話音量制御処理の詳細な流れを示すフローチャートである。ステップＳ３１において、制御部１１６は、ユーザとのあいだの距離を取得する。ステップＳ３２において、制御部１１６は、図５（Ａ）（Ｂ）のテーブルを参照して、距離に応じた基準入力音量と基準出力音量を取得する。 FIG. 5C is a flowchart showing a detailed flow of the speech volume control process in step S28. In step S31, the control unit 116 acquires a distance between the user and the user. In step S32, the control unit 116 refers to the tables in FIGS. 5A and 5B and acquires the reference input volume and the reference output volume corresponding to the distance.

ステップＳ３２において、制御部１１６は、入力音量と基準入力音量を比較する。入力
音量が基準入力音量よりも小さければ、ステップＳ３４に進んで、制御部１１６は、出力音量を基準出力音量よりも小さな音量として決定する。入力音量が基準入力音量と同程度であれば、ステップＳ３５に進んで、制御部１１６は、基準出力音量を出力音量とする。入力音量が基準入力音量よりも大きければ、ステップＳ３６に進んで、制御部１１６は、出力音量を基準出力音量よりも大きな音量として決定する。 In step S32, the control unit 116 compares the input volume with the reference input volume. If the input volume is lower than the reference input volume, the process proceeds to step S34, and the control unit 116 determines the output volume as a volume smaller than the reference output volume. If the input volume is approximately the same as the reference input volume, the process proceeds to step S35, and the control unit 116 sets the reference output volume as the output volume. If the input volume is larger than the reference input volume, the process proceeds to step S36, and the control unit 116 determines the output volume as a volume larger than the reference output volume.

ステップＳ３４やＳ３６において、入力音量と基準入力音量の比較結果に基づいて出力音量を決定する方法はいくつかの方法が考えられる。例えば、入力音量と基準入力音量の差あるいは比に応じて基準出力音量を増減させた値を出力音量とすることができる。あるいは、入力音量と基準入力音量の差あるいは比を所定の基準でレベル分けし、当該レベルに応じて基準出力音量を増減させた値を出力音量とすることができる。 There are several methods for determining the output volume based on the comparison result between the input volume and the reference input volume in steps S34 and S36. For example, a value obtained by increasing or decreasing the reference output volume according to the difference or ratio between the input volume and the reference input volume can be set as the output volume. Alternatively, the difference or ratio between the input volume and the reference input volume can be divided into levels according to a predetermined reference, and a value obtained by increasing or decreasing the reference output volume according to the level can be set as the output volume.

図４の説明に戻る。スマートフォン１１０は、発話音量の決定と応答文の取得が完了したら、音声合成処理部１１２によって音声データを生成して、ロボット１００に対して、決定された発話音量で当該音声データを出力するようにコマンドを送信する。このコマンドを受けて、ロボット１００は、指定された出力音量で指定された応答を音声出力部（スピーカー）１０３から出力する。 Returning to the description of FIG. When the determination of the utterance volume and the acquisition of the response sentence are completed, the smartphone 110 generates voice data by the voice synthesis processing unit 112 and outputs the voice data to the robot 100 at the determined utterance volume. Send a command. In response to this command, the robot 100 outputs a response designated by the designated output volume from the audio output unit (speaker) 103.

なお、ここではユーザとのあいだの距離および発話音量に基づいて、ロボットからの応答出力の音量を調整しているが、ユーザ発話に対するロボット１００の動作量（頭や腕などの動作量）を調整するようにしてもよい。例えば、出力音量を決定するのと同様に、ユーザとの距離が大きいほど、またはユーザ発話音量が大きいほど、ロボット１００の動作量を大きく決定することが考えられる。 Here, the volume of the response output from the robot is adjusted based on the distance to the user and the utterance volume. You may make it do. For example, as in the case of determining the output volume, it is conceivable that the movement amount of the robot 100 is determined to be larger as the distance to the user is larger or the user utterance volume is larger.

＜本実施形態の有利な効果＞
本実施形態に係る音声対話システムでは、ユーザとの距離およびユーザ発話の音量を考慮してロボットからの応答の音量を決定しているので、いずれか一方のみに基づいて応答の音量を決定するよりも状況に即した決定が行える。 <Advantageous effects of this embodiment>
In the voice interaction system according to the present embodiment, the response volume from the robot is determined in consideration of the distance to the user and the volume of the user's utterance. Therefore, the response volume is determined based on only one of them. Can also make decisions according to the situation.

＜変形例＞
上記の説明では、ロボット１００とユーザとのあいだの距離を、撮影画像中のユーザの顔または身体の大きさに基づいて求めているが、当該距離の求め方はこれに限られず、任意の距離センサによって求めて構わない。距離センサとして、レーザ、超音波、赤外線などを用いたものを採用可能である。また、画像を元に距離検出する方法として、ステレオ測距やＤＦＤ測距なども採用可能である。 <Modification>
In the above description, the distance between the robot 100 and the user is obtained based on the size of the user's face or body in the captured image. However, the method for obtaining the distance is not limited to this, and any distance can be obtained. It may be determined by a sensor. A distance sensor using a laser, an ultrasonic wave, an infrared ray, or the like can be used. Further, stereo distance measurement, DFD distance measurement, or the like can be adopted as a method for detecting the distance based on the image.

また、上記の説明では、ユーザとのインタフェースとなるロボットと、音声認識や応答作成などの処理を行うスマートフォン１１０、音声認識サーバ２００、対話サーバ３００を異なる装置で構成する例を示したが、本システムの構成はこれに限られない。例えば、ロボット１００とスマートフォン１１０の機能を１つに装置に搭載してもよいし、さらに、音声認識サーバ２００や対話サーバ３００の機能もまとめて搭載してもよい。あるいは、音声認識サーバ２００および対話サーバ３００が、ロボットあるいはスマートフォントとは異なる１つのサーバで実現されてもよい。 In the above description, an example in which the robot serving as the interface with the user and the smartphone 110 that performs processing such as voice recognition and response generation, the voice recognition server 200, and the dialogue server 300 are configured by different devices has been described. The system configuration is not limited to this. For example, the functions of the robot 100 and the smartphone 110 may be mounted on the apparatus, and the functions of the voice recognition server 200 and the dialogue server 300 may be mounted together. Alternatively, the voice recognition server 200 and the dialogue server 300 may be realized by one server different from the robot or the smartphone.

＜その他＞
上記の実施形態および変形例の構成は、本発明の技術的思想を逸脱しない範囲内で、適宜組み合わせて利用することができる。また、本発明は、その技術的思想を逸脱しない範囲で適宜変更を加えて実現しても構わない。 <Others>
The configurations of the above-described embodiments and modifications can be used in appropriate combinations within a range that does not depart from the technical idea of the present invention. In addition, the present invention may be implemented with appropriate modifications without departing from the technical idea thereof.

１００：ロボット
１０１：音声入力部
１０２：画像入力部
１０３：音声出力部
１０４：コマンド送受信部
１０５：通信部（ＢＴ）
１１０：スマートフォン
１１１：入力音声処理部
１１２：音声合成処理部
１１３：コマンド送受信部
１１４：位置情報処理部
１１５：簡易応答作成部
１１６：制御部
１１７：通信部（ＢＴ）
１１８：通信部（ＴＣＰ／ＩＰ）
２００：音声認識サーバ
２０１：通信部（ＴＣＰ／ＩＰ）
２０２：音声認識処理部
３００：対話サーバ
３０１：通信部（ＴＣＰ／ＩＰ）
３０２：応答作成部
３０３：情報記憶部 100: Robot 101: Audio input unit 102: Image input unit 103: Audio output unit 104: Command transmission / reception unit 105: Communication unit (BT)
110: Smartphone 111: Input speech processing unit 112: Speech synthesis processing unit 113: Command transmission / reception unit 114: Position information processing unit 115: Simple response creation unit 116: Control unit 117: Communication unit (BT)
118: Communication unit (TCP / IP)
200: Voice recognition server 201: Communication unit (TCP / IP)
202: Voice recognition processing unit 300: Dialog server 301: Communication unit (TCP / IP)
302: Response creation unit 303: Information storage unit

Claims

Volume acquisition means for acquiring the input volume of the voice input from the user;
Distance acquisition means for acquiring a distance between the user and the user;
Output volume determination means for determining a sound output volume for the user based on the input volume acquired by the volume acquisition means and the distance acquired by the distance acquisition means;
A voice interaction device comprising:

The output volume determination unit stores a table in which a reference input volume and a reference output volume are defined according to a distance, and corresponds to the input volume acquired by the volume acquisition unit and the distance acquired by the distance acquisition unit. A reference input volume is compared, and a volume obtained by adjusting a reference output volume corresponding to the distance acquired by the distance acquisition unit according to the comparison result is determined as the output volume,
The larger the distance, the smaller the reference input volume and the larger the reference output volume.
The voice interactive apparatus according to claim 1.

The distance acquisition means includes
Image acquisition means for acquiring a user image;
Detecting means for detecting a user's face or body from the image;
Distance detecting means for obtaining a distance between the user's face or body detected by the detecting means and the user;
The voice interactive apparatus according to claim 1, comprising:

The distance acquisition means is a distance sensor.
The voice interaction apparatus according to claim 1 or 2.

A speech volume adjustment method performed by a voice interaction device,
A volume acquisition step for acquiring the input volume of the voice input from the user;
A distance acquisition step of acquiring a distance between the user and the user;
An output volume determination step for determining a volume of audio output for the user based on the input volume acquired in the volume acquisition step and the distance acquired in the distance acquisition step;
Utterance volume adjustment method.

The program for making a computer perform each step of the method of Claim 5.