JP5206151B2

JP5206151B2 - Voice input robot, remote conference support system, and remote conference support method

Info

Publication number: JP5206151B2
Application number: JP2008165286A
Authority: JP
Inventors: 寛之福島
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2008-06-25
Filing date: 2008-06-25
Publication date: 2013-06-12
Anticipated expiration: 2028-06-25
Also published as: JP2010010857A

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a voice input robot that effectively supports voice communication in which a plurality of persons participate. <P>SOLUTION: The robot includes: a voice input unit 111 which receives voice input; a sound source position estimation unit 121 which estimates a sound source position of the sound which the voice input unit 111 receives; and an operation unit 112 that change the position of the voice input unit 111. The sound source position estimation unit 121 estimates sound source positions of the plurality of voices received by the voice input unit 111, and the operation unit 112 changes the positional relationship between the voice input unit 111 and the voice positions of the plurality of voices, based on the estimation results of the sound source position estimation unit 121. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声入力部を備えた音声入力ロボット、そのロボットを有する遠隔会議支援システム、そのロボットを用いた遠隔会議支援方法に関するものである。 The present invention relates to a voice input robot having a voice input unit, a remote conference support system having the robot, and a remote conference support method using the robot.

従来、ロボット装置に関し、『対象物に対してより自然な動作を行えて、エンターテイメント性を向上したロボット装置及びロボット装置の行動制御方法を提供する。』ことを目的とした技術として、『ロボット装置１は、ＣＣＤカメラ２２と、マイクロホン２４と、画像データから動体を検出する動体検出モジュール３２及び人物の顔を検出する顔検出モジュール３３と、音声データから音源方向を推定する音源方向推定モジュール３４と、上記動体検出結果に基づく動体方向、上記顔検出結果に基づく顔方向、及び上記推定された音源方向の何れかに移動するよう制御する制御手段とを有し、制御手段は、動体方向又は音源推定方向に歩行中に顔検出された場合、顔方向に移動するよう制御すると共に、顔検出対象となっている対象物に所定の範囲内に近づいたとき、歩行を停止するよう制御する。』というものが提案されている（特許文献１）。 2. Description of the Related Art Conventionally, regarding a robot apparatus, “a robot apparatus capable of performing a more natural operation on an object and improving entertainment properties and a behavior control method of the robot apparatus are provided. As a technique for the purpose, “the robot apparatus 1 includes a CCD camera 22, a microphone 24, a moving object detection module 32 that detects a moving object from image data, a face detection module 33 that detects a human face, and audio data. A sound source direction estimation module 34 that estimates a sound source direction from the control unit, and a control unit that controls to move to any one of the moving body direction based on the moving body detection result, the face direction based on the face detection result, and the estimated sound source direction; And when the face is detected while walking in the moving body direction or the sound source estimation direction, the control means controls to move in the face direction, and approaches the object that is the face detection target within a predetermined range. Control to stop walking. Is proposed (Patent Document 1).

また、自律行動ロボットに関し、『人間に対してペットのような振る舞いで応答し、人間が親近感を感じられる自律行動ロボットのための行動制御装置を提供する。』ことを目的とした技術として、『ステレオカメラによる画像入力装置１、画像処理によって人物が検出され、人物の顔領域を追跡する人物検出装置２、ステレオカメラの画像から距離を算出する距離算出装置３、人物情報記憶部５の情報から人物を識別する人物識別装置４、ボディに付けられたマイクから構成される音声入力装置６、音源方向検出装置７、音声認識装置８、ロボットの前後左右に設置され、障害物検出装置１０に障害物情報を送出する超音波センサ９、撫でられた場合と叩かれた場合に、それぞれを識別できる信号を行動制御装置１２に送出するタッチセンサ１１、二つの車輪による脚部モータ１３、頭部を回転させる頭部モータ１４、およびロボットの口に付けられた音声出力装置１５から構成される。』というものが提案されている（特許文献２）。 Further, regarding an autonomous behavior robot, “We provide a behavior control device for an autonomous behavior robot that responds to a human with a pet-like behavior and allows the human to feel close. As a technique for the purpose, “an image input device 1 using a stereo camera, a person detection device 2 that detects a person by image processing and tracks the face area of the person, and a distance calculation device that calculates a distance from an image of the stereo camera” 3. Person identification device 4 for identifying a person from information in the person information storage unit 5, voice input device 6 composed of a microphone attached to the body, sound source direction detection device 7, voice recognition device 8, and front, rear, left and right of the robot An ultrasonic sensor 9 that is installed and sends obstacle information to the obstacle detection device 10, a touch sensor 11 that sends a signal that can be identified to each of the behavior control device 12 when stroked and tapped, The motor includes a leg motor 13 using wheels, a head motor 14 that rotates the head, and an audio output device 15 attached to the mouth of the robot. Is proposed (Patent Document 2).

また、対話型ロボットに関し、『対話する人間の操作負担を増加させることなく、音声認識精度を向上させることのできる対話型ロボットを提供する。』ことを目的とした技術として、『音声認識可能な対話型ロボット４００であって、音声認識の対象となる目的音声の音源方向を推定する音源方向推定手段と、音源方向推定手段が推定した音源方向に当該対話型ロボット自身を移動させる移動手段と、移動手段による移動後の位置において、目的音声を取得する目的音声取得手段と、目的音声取得手段が取得した目的音声に対して音声認識を行う音声認識手段とを備えた。』というものが提案されている（特許文献３）。 Further, regarding an interactive robot, “providing an interactive robot capable of improving the accuracy of speech recognition without increasing the operation burden on a person who interacts. As a technology for the purpose of the above, “the interactive robot 400 capable of speech recognition, which is a sound source direction estimating means for estimating the sound source direction of the target speech to be recognized, and the sound source estimated by the sound source direction estimating means. Voice recognition is performed on the target voice acquired by the target voice acquisition means and the target voice acquisition means for acquiring the target voice at the position after the movement by the movement means, the moving means for moving the interactive robot itself in the direction Voice recognition means. Is proposed (Patent Document 3).

特開２００４−１３０４２７号公報（要約）JP 2004-130427 A (summary) 特開２００３−３２６４７９号公報（要約）JP 2003-326479 A (summary) 特開２００６−１８１６５１号公報（要約）JP 2006-181651 A (summary)

例えば遠隔コミュニケーションを行う際のように、マイクを通して音声対話を行う環境下では、発話者とマイクの位置関係によって発話者の音声が聞き取りにくい場合がある。
特に発話者が複数人存在するような状況では、発話者各人の発話音量の差、マイクとの距離・位置関係などにより、発話者毎に音声の聴き取りやすさが異なってしまう。 For example, in an environment where voice conversation is performed through a microphone as in remote communication, the voice of the speaker may be difficult to hear due to the positional relationship between the speaker and the microphone.
In particular, in a situation where there are a plurality of speakers, the ease of listening to the speech differs for each speaker due to the difference in the volume of each speaker and the distance / positional relationship with the microphone.

このような状況下では、マイクで集音された音声を聴く立場の者（遠隔コミュニケーションの例では遠隔地の相手側に当たる）は、発話者に対して「声が聴こえにくい」「もう少しマイクに近づいて話をして欲しい」などの要望を出して状況改善を試みる。
しかし、このようなやり取りは発話の中断を招き、コミュニケーションの円滑な進行を妨げ、参加者に余計なストレスを与えてしまう。 Under these circumstances, the person who listens to the sound collected by the microphone (in the case of remote communication, hits the other party in the remote location) is “seen hard to hear” or “slightly closer to the microphone” to the speaker. "I want you to talk to me" and try to improve the situation.
However, such exchanges cause utterance interruptions, hinder the smooth progress of communication, and add extra stress to participants.

こうした課題に対して、音声を収集するマイクの性能を上げる、設置数を増やすといった改善方法も考えられるが、これらの環境を整備するためのコストを要する。 To solve these problems, improvement methods such as increasing the performance of microphones that collect voice and increasing the number of installations are conceivable, but costs are required to maintain these environments.

一方、上記特許文献１〜３に記載の技術では、音声を取得することで音源位置を推定してその方向にロボットが移動等することが開示されている。これは、発話者に近い位置で音声を入力することを図るものであると考えることができる。
しかし、この動作は人間とロボットが対話するためのものであり、遠隔コミュニケーションを円滑に行うためのものではない。 On the other hand, the techniques described in Patent Documents 1 to 3 disclose that a sound source position is estimated by acquiring sound and the robot moves in that direction. It can be considered that this is intended to input voice at a position close to the speaker.
However, this operation is for human-robot interaction, not for smooth remote communication.

例えば、上記特許文献１〜３に記載の技術を用い、ロボットが移動等することによってロボットとその対話相手の人間との間の距離等を最適化することが考えられる。
しかし、遠隔会議のように複数の人間がコミュニケーションに参加する環境下では、ロボットとその対話相手との２者間関係のみを最適化したとしても、必ずしも会議全体の進行を最適化することにはならない。
即ち、会議に複数の人間が参加している環境、換言すると、複数の音源から生じる音声を全体的に収集することが求められる環境下では、上記特許文献１〜３に記載の技術は必ずしも適していない。 For example, it is conceivable to optimize the distance between the robot and the person of the conversation partner by moving the robot or the like using the techniques described in Patent Documents 1 to 3 above.
However, in an environment where multiple people participate in communication, such as a teleconference, even if only the relationship between the robot and its conversation partner is optimized, it is not always necessary to optimize the progress of the entire conference. Don't be.
That is, in an environment in which a plurality of people participate in a conference, in other words, an environment in which it is required to collect sounds generated from a plurality of sound sources as a whole, the techniques described in Patent Documents 1 to 3 are not necessarily suitable. Not.

そのため、複数の人間が参加する音声コミュニケーションを効果的に支援することができる音声入力ロボットが望まれていた。 Therefore, a voice input robot that can effectively support voice communication in which a plurality of people participate has been desired.

本発明に係る音声入力ロボットは、音声の入力を受け付ける音声入力部と、前記音声入力部が受け付けた音声の音源位置を推定する音源位置推定部と、前記音声入力部の位置を可変する動作部と、を備え、前記音源位置推定部は、前記音声入力部が受け付けた複数の音声の音源位置を推定し、前記動作部は、前記音源位置推定部の推定結果に基づき、音声入力部が受け取る各音声の集音音量が等しくなる方向に音声入力部の位置を移動させることにより、前記音声入力部と前記複数の音声の音源位置との間の位置関係を変更するものである。 The voice input robot according to the present invention includes a voice input unit that receives voice input, a sound source position estimation unit that estimates a voice source position of the voice received by the voice input unit, and an operation unit that varies the position of the voice input unit. The sound source position estimating unit estimates sound source positions of a plurality of sounds received by the sound input unit, and the operation unit receives the sound input unit based on the estimation result of the sound source position estimating unit. The positional relationship between the voice input unit and the sound source positions of the plurality of voices is changed by moving the position of the voice input unit in a direction in which the collected sound volume of each voice is equal .

本発明に係る音声入力ロボットによれば、複数の音源位置から生じる音声を全体的に収集することができるので、複数の人間が参加する音声コミュニケーションを効果的に支援することができる。 According to the voice input robot according to the present invention, since voices generated from a plurality of sound source positions can be collected as a whole, voice communication in which a plurality of people participate can be effectively supported.

実施の形態１．
図１は、本発明の実施の形態１に係る遠隔会議支援システムの構成図である。
本実施の形態１に係る遠隔会議支援システムは、音声入力ロボット１００、会議端末２００を有する。音声入力ロボット１００と会議端末２００は、例えばＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）やインターネットのようなネットワーク３００を介して遠隔接続されている。 Embodiment 1 FIG.
FIG. 1 is a configuration diagram of a remote conference support system according to Embodiment 1 of the present invention.
The remote conference support system according to the first embodiment includes a voice input robot 100 and a conference terminal 200. The voice input robot 100 and the conference terminal 200 are remotely connected via a network 300 such as a LAN (Local Area Network) or the Internet.

音声入力ロボット１００は、ロボット本体部１１０、ロボット制御部１２０を備える。
ロボット本体部１１０は、音声入力ロボット１００の本体筐体と、本体筐体に取り付けられた各構成部分とを備える。具体的な構成は後述する。
ロボット制御部１２０は、音声入力ロボット１００の動作を制御する。具体的な構成は後述する。ロボット制御部１２０およびその各構成部は、その機能を実現する回路デバイスのようなハードウェアで構成することもできるし、マイコンやＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）のような演算装置とその動作を規定するソフトウェアで構成することもできる。また、必要な記憶装置やネットワークインターフェースを適宜備える。 The voice input robot 100 includes a robot body 110 and a robot controller 120.
The robot main body 110 includes a main body casing of the voice input robot 100 and respective components attached to the main body casing. A specific configuration will be described later.
The robot control unit 120 controls the operation of the voice input robot 100. A specific configuration will be described later. The robot control unit 120 and each component thereof can be configured by hardware such as a circuit device that realizes the function, or defines an arithmetic device such as a microcomputer or a CPU (Central Processing Unit) and its operation. It can also be configured with software. In addition, necessary storage devices and network interfaces are provided as appropriate.

ロボット本体部１１０とロボット制御部１２０は、同一筐体上に構成してもよいし、例えばロボット制御部１２０をロボット本体部１１０から切り離して外部に構成し、有線または無線により相互に通信するように構成してもよい。 The robot body 110 and the robot controller 120 may be configured on the same housing. For example, the robot controller 120 is separated from the robot body 110 and configured externally, and communicates with each other by wire or wirelessly. You may comprise.

ロボット本体部１１０は、音声入力部１１１、動作部１１２を備える。 The robot body 110 includes a voice input unit 111 and an operation unit 112.

音声入力部１１１は、例えば複数のマイクロフォンを備えたマイクロフォンアレイなどから構成され、音声入力ロボット１００が存在する周辺の音声を収集する。
音声入力ロボット１００が姿勢を変えることなく全方位からの音声を収集できるようにするためには、マイクロフォンアレイで音声入力部１１１を構成するのが好適である。例えば、単一指向性マイクを円周上に複数配置し、指向方向を円の外側に向ける、といった手法が考えられる。
音声入力部１１１が収集した音声は、後述の音声情報処理部１２１に出力される。 The voice input unit 111 includes, for example, a microphone array including a plurality of microphones, and collects voices around the voice input robot 100.
In order to enable the voice input robot 100 to collect voices from all directions without changing the posture, it is preferable to configure the voice input unit 111 with a microphone array. For example, a method of arranging a plurality of unidirectional microphones on the circumference and directing the directing direction to the outside of the circle can be considered.
The voice collected by the voice input unit 111 is output to the voice information processing unit 121 described later.

動作部１１２は、音声入力ロボット１００が存在する空間において、動作決定部１２３の指示に基づき音声入力部１１１の空間位置を可変する機能を備える。動作部１１２の具体的な構成例については、後述の図２で説明する。 The operation unit 112 has a function of changing the spatial position of the voice input unit 111 based on an instruction from the motion determination unit 123 in a space where the voice input robot 100 exists. A specific configuration example of the operation unit 112 will be described with reference to FIG.

ロボット制御部１２０は、音声情報処理部１２１、統計処理部１２２、動作決定部１２３、データベース１２４、設定部１２５を備える。 The robot control unit 120 includes a voice information processing unit 121, a statistical processing unit 122, an operation determination unit 123, a database 124, and a setting unit 125.

音声情報処理部１２１は、音声入力部１１１が収集した音声を受け取り、その音声の音源位置を推定し、その推定音源の音量を算出する。推定結果や算出結果は、データベース１２４に格納される。音源位置の推定手法は、任意の公知技術など適当なものを適宜用いる。
また、音声情報処理部１２１は、ネットワーク３００を介して、音声入力部１１１から受け取った音声を会議端末２００に送信する。 The voice information processing unit 121 receives the voice collected by the voice input unit 111, estimates the sound source position of the voice, and calculates the volume of the estimated sound source. The estimation result and the calculation result are stored in the database 124. As a sound source position estimation method, an appropriate method such as any known technique is appropriately used.
In addition, the voice information processing unit 121 transmits the voice received from the voice input unit 111 to the conference terminal 200 via the network 300.

統計処理部１２２は、データベース１２４に蓄積されたデータと設定部１２５が受け取った設定情報から、後述の図３〜図５で説明する統計処理を行い、音声入力ロボット１００が存在する空間の音声環境をマップ化（マッピング）して音声分布マップを作成する。作成したマップはデータベース１２４に格納される。
統計処理部１２２が行う統計処理の対象となるのは、音声情報処理部１２１が処理した前述の各情報、即ち音源の推定位置、推定音源位置の音量、時間（サンプリングタイム）などである。 The statistical processing unit 122 performs statistical processing, which will be described later with reference to FIGS. 3 to 5, from the data accumulated in the database 124 and the setting information received by the setting unit 125, and the voice environment of the space where the voice input robot 100 exists. A voice distribution map is created by mapping (mapping). The created map is stored in the database 124.
The statistical processing performed by the statistical processing unit 122 is the above-described information processed by the voice information processing unit 121, that is, the estimated position of the sound source, the volume of the estimated sound source position, time (sampling time), and the like.

動作決定部１２３は、統計処理部１２２が作成した音声分布マップと設定部１２５が受け取った設定情報から、音声入力部１１１の空間位置可変を実行するか否か、および可変先位置を決定する。決定した結果は可変指令として動作部１１２に出力される。 The operation determination unit 123 determines whether or not to change the spatial position of the audio input unit 111 and the variable destination position from the audio distribution map created by the statistical processing unit 122 and the setting information received by the setting unit 125. The determined result is output to the operation unit 112 as a variable command.

データベース１２４は、音声情報処理部１２１が処理した前述の各情報、即ち音源の推定位置、推定音源位置の音量などを時系列順に保持する。データベース１２４は、保持する情報を格納するＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）などの記憶装置を用いて構成することができる。情報の格納形式は任意のものでよい。 The database 124 holds the above-described information processed by the voice information processing unit 121, that is, the estimated position of the sound source, the volume of the estimated sound source position, and the like in time series. The database 124 can be configured using a storage device such as an HDD (Hard Disk Drive) that stores information to be held. The storage format of information may be arbitrary.

設定部１２５は、音声の聴き手側が望む音声環境や音声収集状況、即ち、聴き手側がどのようにして発話者側からの音声を聴きたいのかを設定するための設定情報の入力を受け付ける。設定内容の具体例は後述する。
具体的には、例えばネットワークインターフェースや画面入力を介して、上述の設定情報の入力を受け付ける、といった構成が考えられる。
設定部１２５が受け取った設定情報と統計処理部１２２が作成した音声分布マップから、動作決定部１２３が出力する可変指令の内容が決定される。 The setting unit 125 accepts input of setting information for setting a voice environment and a voice collection situation desired by the voice listener, that is, how the listener wants to hear the voice from the speaker side. Specific examples of setting contents will be described later.
Specifically, for example, a configuration in which an input of the above setting information is accepted via a network interface or a screen input is conceivable.
From the setting information received by the setting unit 125 and the voice distribution map created by the statistical processing unit 122, the contents of the variable command output by the operation determining unit 123 are determined.

なお、設定部１２５が受け付ける「音声の聴き手側が望む音声環境」とは、例えば以下の（１）〜（３）のようなことを言う。 The “sound environment desired by the listener” received by the setting unit 125 is, for example, the following (1) to (3).

（１）各発話者からほぼ等距離となるような位置関係で発話を聴きたい。
この場合は、複数音源からの音声を同時に取得する。音量が発話者の発話音量に大きく影響を受けるため、声量による発話者の感情を読み取りやすい。 (1) I want to listen to utterances in a positional relationship that is approximately equidistant from each speaker.
In this case, sounds from a plurality of sound sources are acquired simultaneously. Since the volume is greatly influenced by the utterance volume of the speaker, it is easy to read the emotion of the speaker based on the volume.

（２）発話者達の発話を同等の音量で聴きたい。
この場合も、複数音源からの音声を同時に取得する。発話・主張の強さに関して、音量差の影響を受けにくい。 (2) I want to listen to the utterances of the speakers at the same volume.
Also in this case, sounds from a plurality of sound sources are acquired simultaneously. The strength of utterances and assertions is less affected by volume differences.

（３）特定発話者の発話を聴きやすい状況で聴きたい。
これは、特定発話者の発言が多い状況、例えば、ある発話者が資料説明を行っているような状況に相当する。
この場合は、発話頻度の高い話者の位置に着目し、発話者の声が大きすぎるときは音量を小さくし、声が小さすぎるときは音量を大きくしたい、といった要望があるものと想定される。 (3) I want to listen in a situation where it is easy to hear the utterances of a specific speaker.
This corresponds to a situation where there are many utterances by a specific speaker, for example, a situation where a certain speaker is explaining a material.
In this case, paying attention to the position of the speaker with high utterance frequency, it is assumed that there is a demand to reduce the volume when the speaker's voice is too loud and to increase the volume when the voice is too loud. .

このように、設定部１２５を介して集音状態の設定変更を可能としたことにより、人間同士のコミュニケーションを音声入力ロボット１００が効果的に支援することができるのである。
この点は、上記特許文献１〜３に記載の対話型ロボットのように、プログラムされた規定の目的に従った動作しか行わない、人間とロボットの間のコミュニケーションを前提とした技術とは異なる。 As described above, by enabling the setting change of the sound collection state via the setting unit 125, the voice input robot 100 can effectively support communication between humans.
This point is different from the technology based on the premise of communication between a human and a robot that performs only an operation according to a programmed specified purpose, such as the interactive robots described in Patent Documents 1 to 3 above.

本実施の形態１における「音源位置推定部」は、音声情報処理部１２１が相当する。
また、「動作部」は、動作部１１２およびその動作内容を決定する動作決定部１２３が相当する。 The “sound source position estimation unit” in the first embodiment corresponds to the voice information processing unit 121.
The “operation unit” corresponds to the operation unit 112 and the operation determination unit 123 that determines the operation content.

会議端末２００は、遠隔の会議参加者が使用する端末であり、例えばノートパソコンのようなコンピュータを用いて構成することができる。また、例えばスピーカ等で構成された音声出力部２１０を備える。
会議端末２００は、ネットワーク３００を介して、ロボット制御部１２０が送信した音声を受信し、音声出力部２１０よりその音声を音声出力する。遠隔の会議参加者は、その音声を聴取することにより、音声入力ロボット１００周辺の会議参加者の音声を聴くことができる。 The conference terminal 200 is a terminal used by a remote conference participant and can be configured using a computer such as a notebook computer. Moreover, the audio | voice output part 210 comprised, for example with the speaker etc. is provided.
The conference terminal 200 receives the voice transmitted by the robot control unit 120 via the network 300 and outputs the voice from the voice output unit 210. The remote conference participant can listen to the audio of the conference participants around the audio input robot 100 by listening to the audio.

以上、本実施の形態１に係る遠隔会議支援システムの構成を説明した。
次に、動作部１１２の具体的な構成例を説明する。 The configuration of the remote conference support system according to the first embodiment has been described above.
Next, a specific configuration example of the operation unit 112 will be described.

図２は、音声入力ロボット１００の外観構成例を示す図である。図２（ａ）は自走式、図２（ｂ）は固定可動式の構成例である。 FIG. 2 is a diagram illustrating an external configuration example of the voice input robot 100. FIG. 2A is a self-propelled type, and FIG. 2B is a fixed movable type configuration example.

図２（ａ）に示す自走式構成では、動作部１１２は平面上を任意方向に移動可能な車両で構成され、その車両台座の上にマイクで構成された音声入力部１１１を複数設置する構成とした。
車両で構成された動作部１１２は、動作決定部１２３の指示に基づき車輪を駆動させ、指示された方向に音声入力ロボット１００を移動させる。 In the self-propelled configuration shown in FIG. 2A, the operation unit 112 is configured by a vehicle that can move in an arbitrary direction on a plane, and a plurality of voice input units 111 configured by microphones are installed on the vehicle pedestal. The configuration.
The operation unit 112 configured by a vehicle drives the wheel based on an instruction from the operation determination unit 123 and moves the voice input robot 100 in the instructed direction.

図２（ｂ）に示す固定可動式構成では、動作部１１２は底部台座に固定された可動スイングアームで構成され、可動スイングアームの上に固定された台座上にマイクで構成された音声入力部１１１を複数設置する構成とした。
可動スイングアームで構成された動作部１１２は、動作決定部１２３の指示に基づきアームの姿勢（ヨー・ピッチ角）や長さを可変することで、音声入力部１１１の空間位置を移動させる。 In the fixed movable configuration shown in FIG. 2B, the operation unit 112 is configured by a movable swing arm fixed to the bottom pedestal, and a voice input unit configured by a microphone on the pedestal fixed on the movable swing arm. The configuration is such that a plurality of 111 are installed.
The operation unit 112 configured by a movable swing arm moves the spatial position of the voice input unit 111 by changing the posture (yaw / pitch angle) and length of the arm based on an instruction from the operation determination unit 123.

以上、動作部１１２の具体的な構成例を説明した。
次に、統計処理部１２２が作成する音声分布マップの例について、先に述べた設定部１２５に入力される「音声の聴き手側が望む音声環境」との関連から説明する。 The specific configuration example of the operation unit 112 has been described above.
Next, an example of an audio distribution map created by the statistical processing unit 122 will be described in relation to the “audio environment desired by the audio listener” input to the setting unit 125 described above.

図３は、音源位置のみを基準として作成した音声分布マップの例である。以下、図３を用いて音声入力部１１１の空間位置を変更する過程を説明する。
なお、ここでは上述の設定情報として、「（１）各発話者からほぼ等距離となるような位置関係で発話を聴きたい。」が設定部１２５に入力されたものとする。 FIG. 3 is an example of an audio distribution map created with reference to only the sound source position. Hereinafter, the process of changing the spatial position of the voice input unit 111 will be described with reference to FIG.
Here, it is assumed that “(1) I want to listen to an utterance in a positional relationship that is approximately equidistant from each utterer” is input to the setting unit 125 as the setting information.

図３（ａ）は、会議参加者と音声入力ロボット１００の初期状態を示す図である。同図において、１〜３は会議参加者の位置、黒三角は音声入力ロボット１００の初期位置を示している。
図３（ａ）の状態では、音声入力ロボット１００と会議参加者２の距離が最も近く、他の会議参加者と音声入力ロボット１００の距離は遠い。 FIG. 3A is a diagram illustrating an initial state of the conference participant and the voice input robot 100. In the figure, 1 to 3 indicate the positions of the conference participants, and the black triangle indicates the initial position of the voice input robot 100.
In the state of FIG. 3A, the distance between the voice input robot 100 and the conference participant 2 is the closest, and the distance between the other conference participants and the voice input robot 100 is far.

音声情報処理部１２１は、所定のサンプリングタイム内で、音声入力部１１１より会議参加者１〜３の発話音声を受け取り、各会議参加者の音源位置を推定してデータベース１２４に格納する。
統計処理部１２２は、各会議参加者の音源位置の推定結果を用いて、図３（ａ）のような２次元平面座標上に各会議参加者の位置をマッピングした音声分布マップを作成する。 The voice information processing unit 121 receives the voices of the conference participants 1 to 3 from the voice input unit 111 within a predetermined sampling time, estimates the sound source positions of the conference participants, and stores them in the database 124.
The statistical processing unit 122 uses the estimation result of the sound source position of each conference participant to create an audio distribution map in which the location of each conference participant is mapped on the two-dimensional plane coordinates as shown in FIG.

図３（ｂ）は、動作決定部１２３が音声入力ロボット１００の移動先を決定する様子を示す図である。
動作決定部１２３は、図３（ａ）に示す音声分布マップおよび設定部１２５が受け取った設定情報に基づき、音声入力ロボット１００（または音声入力部１１１、以下同様）と各会議参加者の距離がそれぞれ等距離となるように、音声入力ロボット１００の移動先を決定する。 FIG. 3B is a diagram illustrating a state in which the motion determination unit 123 determines the destination of the voice input robot 100.
Based on the audio distribution map shown in FIG. 3A and the setting information received by the setting unit 125, the operation determining unit 123 determines the distance between the audio input robot 100 (or the audio input unit 111, the same applies below) and each conference participant. The movement destination of the voice input robot 100 is determined so as to be equidistant from each other.

図３（ｃ）は、音声入力ロボット１００が移動した後の音声分布マップである。音声入力ロボット１００の空間位置が移動することにより、音声入力ロボット１００と各会議参加者の距離がそれぞれ等距離となる。 FIG. 3C is a voice distribution map after the voice input robot 100 has moved. As the spatial position of the voice input robot 100 moves, the distance between the voice input robot 100 and each conference participant becomes equal.

図４は、音源位置と各音源の音量を基準として作成した音声分布マップの例である。以下、図４を用いて音声入力部１１１の空間位置を変更する過程を説明する。
なお、ここでは上述の設定情報として、「（２）発話者達の発話を同等の音量で聴きたい。」が設定部１２５に入力されたものとする。 FIG. 4 is an example of an audio distribution map created based on the sound source position and the volume of each sound source. Hereinafter, the process of changing the spatial position of the voice input unit 111 will be described with reference to FIG.
Here, it is assumed that “(2) I want to listen to the utterances of speakers at an equivalent volume” is input to the setting unit 125 as the setting information.

図４（ａ）は、会議参加者と音声入力ロボット１００の初期状態を示す図である。同図において、１〜３は会議参加者の位置、円の大きさは各会議参加者の発話音量、黒三角は音声入力ロボット１００の初期位置を示している。
図４（ａ）の状態では、音声入力ロボット１００と会議参加者１の距離が最も近く、これに対応して会議参加者１から集音される音量が最も大きくなっている。 FIG. 4A is a diagram illustrating an initial state of the conference participant and the voice input robot 100. In the figure, 1 to 3 indicate the positions of the conference participants, the size of the circle indicates the utterance volume of each conference participant, and the black triangle indicates the initial position of the voice input robot 100.
In the state of FIG. 4A, the distance between the voice input robot 100 and the conference participant 1 is the shortest, and the volume collected from the conference participant 1 corresponding to this is the highest.

音声情報処理部１２１は、所定のサンプリングタイム内で、音声入力部１１１より会議参加者１〜３の発話音声を受け取り、各会議参加者の音源位置を推定してデータベース１２４に格納する。また、各会議参加者の発話音量を算出してデータベース１２４に格納する。
ここでいう発話音量とは、例えばサンプリングタイム内での最大／最小音量、あるいはサンプリングタイム内での音量の平均値、といった値のことである。
統計処理部１２２は、各会議参加者の音源位置の推定結果を用いて、図４（ａ）のような２次元平面座標上に各会議参加者の位置と発話音量をマッピングした音声分布マップを作成する。 The voice information processing unit 121 receives the voices of the conference participants 1 to 3 from the voice input unit 111 within a predetermined sampling time, estimates the sound source positions of the conference participants, and stores them in the database 124. Further, the utterance volume of each conference participant is calculated and stored in the database 124.
The speech volume here refers to a value such as the maximum / minimum volume within the sampling time or the average value of the volume within the sampling time.
The statistical processing unit 122 uses the estimation result of the sound source position of each conference participant to generate an audio distribution map in which the location of each conference participant and the speech volume are mapped on the two-dimensional plane coordinates as shown in FIG. create.

図４（ｂ）は、動作決定部１２３が音声入力ロボット１００の移動先を決定する様子を示す図である。
動作決定部１２３は、図４（ａ）に示す音声分布マップおよび設定部１２５が受け取った設定情報に基づき、音声入力ロボット１００が集音する各会議参加者の発話音量がそれぞれ同等になるように、音声入力ロボット１００の移動先を決定する。 FIG. 4B is a diagram illustrating a state in which the motion determination unit 123 determines the destination of the voice input robot 100.
Based on the voice distribution map shown in FIG. 4A and the setting information received by the setting unit 125, the action determining unit 123 makes the speech volume of each conference participant that the voice input robot 100 collects equal. The movement destination of the voice input robot 100 is determined.

図４（ｃ）は、音声入力ロボット１００が移動した後の音声分布マップである。音声入力ロボット１００の空間位置が移動することにより、音声入力ロボット１００が集音する各会議参加者の発話音量（円の大きさ）がそれぞれ同等になる。 FIG. 4C is a voice distribution map after the voice input robot 100 has moved. As the spatial position of the voice input robot 100 moves, the utterance volume (circle size) of each conference participant picked up by the voice input robot 100 becomes equal.

図５は、音源位置、各音源の音量、および各音源の音声発生頻度を基準として作成した音声分布マップの例である。以下、図５を用いて音声入力部１１１の空間位置を変更する過程を説明する。
なお、ここでは上述の設定情報として、「（３）特定発話者の発話を聴きやすい状況で聴きたい。」が設定部１２５に入力されたものとする。 FIG. 5 is an example of a sound distribution map created based on the sound source position, the volume of each sound source, and the sound generation frequency of each sound source. Hereinafter, the process of changing the spatial position of the voice input unit 111 will be described with reference to FIG.
Here, it is assumed that “(3) I want to listen in a situation where it is easy to listen to the utterance of a specific speaker” is input to the setting unit 125 as the setting information.

図５（ａ）は、会議参加者と音声入力ロボット１００の初期状態を示す図である。同図において、１〜３は会議参加者の位置、円の大きさは各会議参加者の発話音量、円の輪数は発話回数、黒三角は音声入力ロボット１００の初期位置を示している。
なお、聴き手側は、会議参加者３の発話を聴きやすい状況を希望しているものとする。 FIG. 5A is a diagram illustrating an initial state of the conference participant and the voice input robot 100. In the figure, 1 to 3 represent the positions of the conference participants, the size of the circle represents the utterance volume of each conference participant, the number of circles represents the number of utterances, and the black triangle represents the initial position of the voice input robot 100.
It is assumed that the listener side desires a situation where it is easy to listen to the speech of the conference participant 3.

音声情報処理部１２１は、所定のサンプリングタイム内で、音声入力部１１１より会議参加者１〜３の発話音声を受け取り、各会議参加者の音源位置を推定してデータベース１２４に格納する。また、各会議参加者の発話音量と発話回数を算出してデータベース１２４に格納する。
統計処理部１２２は、各会議参加者の音源位置の推定結果を用いて、図５（ａ）のような２次元平面座標上に各会議参加者の位置、発話音量、および発話回数をマッピングした音声分布マップを作成する。 The voice information processing unit 121 receives the voices of the conference participants 1 to 3 from the voice input unit 111 within a predetermined sampling time, estimates the sound source positions of the conference participants, and stores them in the database 124. Further, the utterance volume and the number of utterances of each conference participant are calculated and stored in the database 124.
The statistical processing unit 122 maps each conference participant's position, utterance volume, and number of utterances on the two-dimensional plane coordinates as shown in FIG. Create an audio distribution map.

図５（ｂ）は、動作決定部１２３が音声入力ロボット１００の移動先を決定する様子を示す図である。
動作決定部１２３は、図５（ａ）に示す音声分布マップおよび設定部１２５が受け取った設定情報に基づき、音声入力ロボット１００が集音する会議参加者３の発話音量が最も大きくなるように、音声入力ロボット１００の移動先を決定する。 FIG. 5B is a diagram illustrating a state in which the operation determination unit 123 determines the destination of the voice input robot 100.
Based on the voice distribution map shown in FIG. 5A and the setting information received by the setting unit 125, the action determination unit 123 sets the utterance volume of the conference participant 3 collected by the voice input robot 100 to the maximum. The movement destination of the voice input robot 100 is determined.

図５（ｃ）は、音声入力ロボット１００が移動した後の音声分布マップである。
音声入力ロボット１００の空間位置が移動することにより、音声入力ロボット１００が集音する会議参加者３の発話音量（円の大きさ）が最も大きくなり、他の会議参加者の発話音量は小さくなる。
なお、音声入力ロボット１００が移動しても発話回数自体は変化しないため、各円の輪数は変化しない。 FIG. 5C is a voice distribution map after the voice input robot 100 has moved.
As the spatial position of the voice input robot 100 moves, the utterance volume (size of the circle) of the conference participant 3 picked up by the voice input robot 100 is maximized, and the utterance volume of other conference participants is reduced. .
Note that the number of utterances itself does not change even when the voice input robot 100 moves, so the number of rings in each circle does not change.

以上、統計処理部１２２が作成する音声分布マップの例を説明した。 The example of the voice distribution map created by the statistical processing unit 122 has been described above.

なお、動作決定部１２３は、音声入力ロボット１００自体から発生する音や、音声入力ロボット１００が移動することで集音状態が変化することを考慮し、移動先を決定した後すぐに移動指示を出すのではなく、以下のような条件のいずれかが満たされたときに動作部１１２へ移動指示を出す。 Note that the motion determination unit 123 issues a movement instruction immediately after determining the destination in consideration of the sound generated from the voice input robot 100 itself and the change in the sound collection state due to movement of the voice input robot 100. Instead, a movement instruction is issued to the operation unit 112 when any of the following conditions is satisfied.

（条件１）ある単位時間、各音源からの音声の発生がない状態が継続する。
（条件２）各音源から発生する音量が一定レベル以下の状態になる。 (Condition 1) A state in which no sound is generated from each sound source continues for a certain unit time.
(Condition 2) The volume generated from each sound source is in a state below a certain level.

また、音声入力ロボット１００の移動中は、上記と同様に、音声入力ロボット１００自体から発生する音や、音声入力ロボット１００が移動することで集音状態が変化することを考慮し、統計処理を中断する。
具体的には、動作決定部１２３より統計処理部１２２にその旨を指示するとよい。 Further, during the movement of the voice input robot 100, statistical processing is performed in consideration of the sound generated from the voice input robot 100 itself and the change in the sound collection state due to the movement of the voice input robot 100, as described above. Interrupt.
Specifically, the operation determination unit 123 may instruct the statistical processing unit 122 to that effect.

図６は、聴き手側の望む音声状況（音声入力ロボット１００の集音状態）になるように音声入力ロボット１００を動作させ、音声環境を改善する動作フローである。ここでは、遠隔会議の場面を想定する。以下、図６の各ステップについて説明する。 FIG. 6 is an operation flow for improving the voice environment by operating the voice input robot 100 so that the voice situation desired by the listener (sound collection state of the voice input robot 100) is achieved. Here, a remote conference scene is assumed. Hereinafter, each step of FIG. 6 will be described.

（Ｓ６０１）
音声入力部１１１を通しての音声のやり取りが終了するまで、以下のステップが繰り返される。音声のやり取りが終了するとは、例えば遠隔会議が終了することを指す。
（Ｓ６０２）
音声入力部１１１は、音声入力ロボット１００が存在する空間、ここでは発話側の会議室の音声を取得する。取得した音声は、ロボット制御部１２０へ送信される。 (S601)
The following steps are repeated until voice exchange through the voice input unit 111 is completed. The termination of the voice exchange indicates that the remote conference is terminated, for example.
(S602)
The voice input unit 111 acquires the voice of the space where the voice input robot 100 exists, here, the conference room on the utterance side. The acquired voice is transmitted to the robot control unit 120.

（Ｓ６０３）
音声情報処理部１２１は、音声入力部１１１から受け取った音声に基づき、音源位置の推定、推定音源の音量、推定音源の音声出力回数、などの演算処理を実行する。また、音声入力部１１１から受け取った音声を会議端末２００に送信する。
（Ｓ６０４）
音声情報処理部１２１は、ステップＳ６０３の結果をデータベース１２４に格納する。
（Ｓ６０５）
音声入力ロボット１００が移動中である場合はステップＳ６１１へ進み、移動中でない場合はステップＳ６０６へ進む。 (S603)
The audio information processing unit 121 performs arithmetic processing such as estimation of a sound source position, sound volume of the estimated sound source, and the number of times of sound output of the estimated sound source based on the sound received from the sound input unit 111. In addition, the voice received from the voice input unit 111 is transmitted to the conference terminal 200.
(S604)
The voice information processing unit 121 stores the result of step S603 in the database 124.
(S605)
If the voice input robot 100 is moving, the process proceeds to step S611, and if not, the process proceeds to step S606.

（Ｓ６０６）
統計処理部１２２は、データベース１２４に格納されている各データ、および設定部１２５が受け取った設定情報（聴き手側が望む音声環境）に基づき、先に説明した統計処理を実行する。
（Ｓ６０７）
統計処理部１２２は、ステップＳ６０６の処理結果に基づき、図３〜図５で説明したような音声分布マップを作成する。作成した音声分布マップは、任意のデータ形式でデータベース１２４に格納する。 (S606)
The statistical processing unit 122 executes the statistical processing described above based on each data stored in the database 124 and the setting information (audio environment desired by the listener) received by the setting unit 125.
(S607)
The statistical processing unit 122 creates an audio distribution map as described with reference to FIGS. 3 to 5 based on the processing result of step S606. The created voice distribution map is stored in the database 124 in an arbitrary data format.

（Ｓ６０８）
動作決定部１２３は、ステップＳ６０７で作成された音声分布マップ、および設定部１２５が受け取った設定情報に基づき、音声環境を聴き手側が望むように改善するために、音声入力ロボット１００の位置を変更する必要があるか否かを判定する。
位置を変更する必要がある場合はステップＳ６０９へ進み、必要がない場合はステップＳ６０２に戻って繰り返し処理を継続する。
（Ｓ６０９）
動作決定部１２３は、ステップＳ６０７で作成された音声分布マップ、および設定部１２５が受け取った設定情報に基づき、音声入力ロボット１００の移動先位置を決定する。 (S608)
Based on the voice distribution map created in step S607 and the setting information received by the setting unit 125, the action determining unit 123 changes the position of the voice input robot 100 in order to improve the voice environment as desired by the listener. Determine if you need to do that.
If the position needs to be changed, the process proceeds to step S609, and if not necessary, the process returns to step S602 to continue the repetition process.
(S609)
The action determining unit 123 determines the destination position of the voice input robot 100 based on the voice distribution map created in step S607 and the setting information received by the setting unit 125.

（Ｓ６１０）
動作決定部１２３は、音声入力ロボット１００の移動・動作を開始・実行してよいか否かを判断する。ここでの判断とは、上述の条件１〜２が満たされているか否かを判断することを指す。
音声入力ロボット１００の移動・動作を許可する場合はステップＳ６１１へ進み、許可しない場合はステップＳ６０２に戻って繰り返し処理を継続する。
（Ｓ６１１）
動作決定部１２３は、動作部１１２に動作指令を出す。動作部１１２は、その動作指令に基づき音声入力ロボット１００を駆動して音声入力部１１１の空間位置を可変する。 (S610)
The motion determination unit 123 determines whether the movement / motion of the voice input robot 100 may be started / executed. The determination here refers to determining whether or not the above-described conditions 1 and 2 are satisfied.
If the movement / operation of the voice input robot 100 is permitted, the process proceeds to step S611. If not permitted, the process returns to step S602 to continue the repeated processing.
(S611)
The operation determination unit 123 issues an operation command to the operation unit 112. The operation unit 112 drives the voice input robot 100 based on the operation command to change the spatial position of the voice input unit 111.

以上、音声入力ロボット１００を動作させて音声環境を改善するフローを説明した。
音声入力ロボット１００を動作させることにより、音声入力部１１１の集音状態が聴き手側の望む状態に変化する。 The flow for improving the voice environment by operating the voice input robot 100 has been described above.
By operating the voice input robot 100, the sound collection state of the voice input unit 111 changes to a state desired by the listener.

以上のように、本実施の形態１によれば、複数の音源位置から生じる音声を設定部１２５が受け取った設定情報に合致する条件の下で全体的に収集することができるので、遠隔会議のように複数の人間が参加する音声コミュニケーションを効果的に支援することができる。 As described above, according to the first embodiment, it is possible to collect the sound generated from a plurality of sound source positions under the conditions matching the setting information received by the setting unit 125, so that the remote conference Thus, it is possible to effectively support voice communication in which a plurality of people participate.

また、本実施の形態１によれば、例えば遠隔会議のように音声入力手段を通して音声のやり取りを行う環境において、聴き手側の望む音声状況（音声入力部１１１の集音状態）となるように音声入力ロボット１００を移動させて音声環境を改善することができる。 Further, according to the first embodiment, in an environment where voice is exchanged through voice input means, for example, in a remote conference, the voice situation desired by the listener (sound collecting state of the voice input unit 111) is obtained. The voice environment can be improved by moving the voice input robot 100.

また、本実施の形態１によれば、聴き手側の望む音声環境を得るという受話者側の利点以外にも、発話者側にとっての利点もある。 Further, according to the first embodiment, there is an advantage for the speaker side in addition to the advantage of the listener side for obtaining the desired voice environment on the listener side.

従来の遠隔会議に関する技術では、発話状況が聴き手側にどのように聴こえているかについて発話者側へのフィードバックが乏しい。
例えば、聴き手側から「声がよく聴こえない」といった会話によるフィードバックを得る以外に、フィードバックを得る手段がない。したがって、聴き手側が会話によるフィードバックをしなければ、発話者側が得られるフィードバックはない。
また、聴き手側から会話によるフィードバックを都度行っているようでは、円滑なコミュニケーションの妨げになる。 In the conventional technology related to the remote conference, there is little feedback to the speaker side as to how the utterance situation is heard on the listener side.
For example, there is no means for obtaining feedback other than obtaining feedback from a conversation such as “I cannot hear the voice well” from the listener side. Therefore, if the listener does not provide feedback by conversation, there is no feedback that can be obtained by the speaker.
Moreover, smooth communication is hindered if feedback from the listener is given each time.

この課題につき、本実施の形態１によれば、発話者側の会議空間において音声入力ロボット１００が実際に移動することそのものが、聴き手側が集音状態の改善を望んでいるというフィードバックを発話者に与えることになる。
発話者側は、例えば音声入力ロボット１００が自分に近づいてくるといった動作を見ることで、自分の発話が聴き手側によく聴こえていないのではないか、といったことに気づくことができる。 Regarding this problem, according to the first embodiment, the fact that the voice input robot 100 actually moves in the conference space on the speaker side provides feedback that the listener side wants to improve the sound collection state. Will be given to.
The speaker side can recognize that his / her utterance is not often heard by the listener side, for example, by looking at an operation in which the voice input robot 100 approaches the user.

この点、音声信号の増幅演算処理などのソフトウェア処理によって集音状態を改善することも考えられる。
これに対し、本実施の形態１では、音声入力ロボット１００自体が移動するという動作により、集音状態の改善と、発話者へのフィードバックとを、同時に行うことができるのである。 In this regard, it is also conceivable to improve the sound collection state by software processing such as audio signal amplification calculation processing.
On the other hand, in the first embodiment, the sound input state can be improved and the feedback to the speaker can be simultaneously performed by the movement of the voice input robot 100 itself.

実施の形態２．
実施の形態１では、音声入力ロボット１００が移動する際に、音声入力ロボット１００自身から発生する音の影響や、音声入力ロボット１００が移動することによる集音状態の変化に鑑み、所定の条件を満たすまでは音声入力ロボット１００の移動を許可しないこととした。 Embodiment 2. FIG.
In the first embodiment, when the voice input robot 100 moves, predetermined conditions are set in consideration of the influence of sound generated from the voice input robot 100 itself and the change in the sound collection state due to the movement of the voice input robot 100. The movement of the voice input robot 100 is not permitted until it is satisfied.

こうした動作を行う場合、音声入力ロボット１００に対する移動指示が出てから実際に移動するまでにタイムラグが生じる。したがって、音声入力ロボット１００の移動により発話者側へ聴き手側の要望を間接的にフィードバックするのが遅れてしまう。
音声入力ロボット１００の移動やフィードバックが遅れれば、その分だけ聴き手側の要望が反映されるのが遅れ、発話を聴き取りづらい状態が継続することを余儀なくされる。 When such an operation is performed, there is a time lag between when the movement instruction is given to the voice input robot 100 and when the movement is actually performed. Therefore, the movement of the voice input robot 100 delays the feedback of the listener's request indirectly to the speaker.
If the movement or feedback of the voice input robot 100 is delayed, the listener's request is delayed by that amount, and it is forced to continue a state in which it is difficult to listen to the utterance.

そこで、本実施の形態２では、上述のようなフィードバックの遅れを解消し、発話者側の注意を喚起して発話状況の改善（発話者が位置を変える、音量を上げる、など）を促すことを図る。 Therefore, in the second embodiment, the feedback delay as described above is eliminated, and the speaker's attention is urged to improve the utterance situation (such as the speaker changing the position or increasing the volume). Plan.

図７は、本発明の実施の形態２に係る遠隔会議支援システムの構成図である。
本実施の形態２に係る遠隔会議支援システムは、実施の形態１の図１で説明した構成に加え、ロボット本体部１１０に表示部１１３を備える。その他の構成は図１と概ね同様であるため、以下では差異点を中心に説明する。 FIG. 7 is a configuration diagram of the remote conference support system according to Embodiment 2 of the present invention.
The remote conference support system according to the second embodiment includes a display unit 113 in the robot body 110 in addition to the configuration described in FIG. 1 of the first embodiment. Since the other configuration is substantially the same as that of FIG. 1, the following description will focus on the differences.

表示部１１３は、動作決定部１２３の指示に基づき、音声入力ロボット１００の移動方向や移動位置を表示する機能部である。
動作決定部１２３は、統計処理部１２２の統計処理に基づき音声入力ロボット１００の移動先位置や方向を決定した後、動作部１１２にその旨の指示を出す前に、表示部１１３にその位置や方向を表示させる。 The display unit 113 is a functional unit that displays a moving direction and a moving position of the voice input robot 100 based on an instruction from the operation determining unit 123.
After determining the movement destination position and direction of the voice input robot 100 based on the statistical processing of the statistical processing unit 122, the operation determining unit 123 displays the position and the position on the display unit 113 before issuing an instruction to that effect. Display direction.

このように、音声入力ロボット１００に対する移動指示が生じた際に、実際の移動によって初めてその内容を表面化させるのではなく、事前に表示することにより、発話者は聴き手側にどのように音声が伝わっているのかを間接的に知ることができる。
また、表示のみを行うので、音声入力ロボット１００の移動による音声環境の変化を生じさせることもない。 In this way, when a movement instruction to the voice input robot 100 is generated, the content is not surfaced for the first time by actual movement, but is displayed in advance, so that the speaker can hear the voice on the listener side. You can know indirectly whether it is transmitted.
Further, since only the display is performed, there is no change in the voice environment due to the movement of the voice input robot 100.

一方、移動方向や位置を表示することによって、音声入力ロボット１００が移動しようとしていることを発話者に知らせ、以下の効果を発揮する。
即ち、音声入力ロボット１００が移動を開始するために、発話者は発話を一時中断し、音声入力ロボット１００の移動が完了するまで発話の間を空ける、といった行動をとることが可能になる。 On the other hand, by displaying the moving direction and position, the speaker is notified that the voice input robot 100 is about to move, and the following effects are exhibited.
That is, since the voice input robot 100 starts moving, the speaker can take an action such as temporarily suspending the utterance and leaving the utterance until the movement of the voice input robot 100 is completed.

図８は、表示部１１３の構成例を示す図である。図８（ａ）はプロジェクタを用いて表示部１１３を構成した例、図８（ｂ）はＬＥＤ（ＬｉｇｈｔＥｍｉｔｔｉｎｇＤｉｏｄｅ）を用いて表示部１１３を構成した例を示している。 FIG. 8 is a diagram illustrating a configuration example of the display unit 113. 8A illustrates an example in which the display unit 113 is configured using a projector, and FIG. 8B illustrates an example in which the display unit 113 is configured using an LED (Light Emitting Diode).

図８（ａ）の例では、プロジェクタを用いて構成された表示部１１３は、音声入力ロボット１００が移動しようとしている方向を、矢印のような図形や文字等を用いて音声入力ロボット１００の周辺空間に投射する。
具体的には、例えば矢印の向きで移動方向を表し、矢印の長さで移動距離を表す、といった手法が考えられる。これ以外の手法でもよいし、矢印や文字以外の表現方法を用いてもよい。 In the example of FIG. 8A, the display unit 113 configured using a projector indicates the direction in which the voice input robot 100 is about to move around the voice input robot 100 using figures, characters, and the like such as arrows. Project into space.
Specifically, for example, a method of expressing the moving direction by the direction of the arrow and expressing the moving distance by the length of the arrow can be considered. Other methods may be used, and expression methods other than arrows and characters may be used.

図８（ｂ）の例では、音声入力ロボット１００の周辺に円周方向にＬＥＤを複数配置して、音声入力ロボット１００が移動しようとしている方向のＬＥＤを点灯させることにより、移動方向を表示する。 In the example of FIG. 8B, a plurality of LEDs are arranged in the circumferential direction around the voice input robot 100, and the direction of movement of the voice input robot 100 is turned on to display the moving direction. .

図８（ａ）（ｂ）いずれの場合でも、音声入力ロボット１００が移動しようとしていないときは表示をＯＦＦしておく。 8A and 8B, the display is turned off when the voice input robot 100 is not about to move.

図９は、本実施の形態２において、聴き手側の望む音声状況（音声入力ロボット１００の集音状態）になるように音声入力ロボット１００を動作させ、音声環境を改善する動作フローである。図６と同様に、遠隔会議の場面を想定する。以下、図９の各ステップについて説明する。 FIG. 9 is an operation flow for improving the voice environment by operating the voice input robot 100 so that the voice situation desired by the listener (sound collection state of the voice input robot 100) is achieved in the second embodiment. As in FIG. 6, a remote conference scene is assumed. Hereinafter, each step of FIG. 9 will be described.

（Ｓ９０１）〜（Ｓ９０９）
図６のステップＳ６０１〜Ｓ６０９と同様であるため、説明を省略する。
（Ｓ９１０）
動作決定部１２３は、表示部１１３に音声入力ロボット１００の移動方向を表示するよう指示を出す。表示部１１３は、その指示に基づき音声入力ロボット１００の移動方向を表示する。 (S901) to (S909)
This is the same as steps S601 to S609 in FIG.
(S910)
The operation determination unit 123 instructs the display unit 113 to display the moving direction of the voice input robot 100. The display unit 113 displays the moving direction of the voice input robot 100 based on the instruction.

（Ｓ９１１）
動作決定部１２３は、音声入力ロボット１００の移動・動作を開始・実行してよいか否か、即ち実施の形態１で説明した条件１〜２が満たされているか否かを判断する。
音声入力ロボット１００の移動・動作を許可する場合はステップＳ９１２へ進み、許可しない場合はステップＳ９０１のループを継続する。
（Ｓ９１２）
図６のステップＳ６１１と同様であるため、説明を省略する。 (S911)
The motion determination unit 123 determines whether or not the movement / motion of the voice input robot 100 can be started / executed, that is, whether the conditions 1 and 2 described in the first embodiment are satisfied.
If the movement / operation of the voice input robot 100 is permitted, the process proceeds to step S912. If not permitted, the loop of step S901 is continued.
(S912)
Since it is the same as step S611 of FIG. 6, description is abbreviate | omitted.

以上、本実施の形態２において、音声入力ロボット１００を動作させて音声環境を改善するフローを説明した。 As described above, in the second embodiment, the flow for operating the voice input robot 100 to improve the voice environment has been described.

表示部１１３の表示内容は、必ずしも音声入力ロボット１００の移動先に関する情報のみでなくともよい。即ち、聴き手側が聴取している集音状態を、直接・間接を問わず発話者に何らかの形でフィードバックすることができればよい。
例えば、発話を中断したくはないが、現在の発話内容に対して質問がある、といった聴き手側の意思を表示して発話者にその旨をフィードバックしてもよい。このような表示により、コミュニケーションの円滑を図ることができる。 The display content of the display unit 113 is not necessarily limited to information regarding the movement destination of the voice input robot 100. That is, it suffices if the sound collection state being listened to by the listener can be fed back to the speaker in some form, directly or indirectly.
For example, the listener's intention that the user does not want to interrupt the utterance but has a question about the current utterance content may be displayed and feedback to the speaker. Such display allows smooth communication.

このように、少なくとも集音状態を示唆する情報を発話者に通知する手段を備えることにより、本実施の形態２と同様の効果を発揮することができるのである。 Thus, by providing means for notifying the speaker of information suggesting at least the sound collection state, the same effect as in the second embodiment can be exhibited.

以上のように、本実施の形態２によれば、表示部１１３は、音声入力ロボット１００が移動しようとしていることを事前に表示するので、聴き手側で音声がどのように聴こえているかを発話者に間接的にフィードバックすることが可能となる。これにより、発話者は聴き手側を意識した発話を行うことができる。
フィードバックを得た発話者は、音声入力ロボット１００が移動しようとしている方向から、自己の発話状態を変更する、音声入力ロボット１００の移動開始条件を満たすように発話の間を取る、といった対応を取ることができる。 As described above, according to the second embodiment, the display unit 113 displays in advance that the voice input robot 100 is about to move, and thus utters how the voice is heard on the listener side. It is possible to provide feedback indirectly to the person. Thereby, the speaker can perform the utterance in consideration of the listener side.
The speaker who has obtained feedback takes actions such as changing his / her utterance state from the direction in which the voice input robot 100 is about to move, and taking intervals between utterances so as to satisfy the movement start condition of the voice input robot 100. be able to.

実施の形態３．
以上の実施の形態１〜２では、音声情報処理部１２１は音声入力部１１１から受け取った音声をそのまま会議端末２００に送信することとした。
音声情報処理部１２１は、必要に応じて、音声入力部１１１から受け取った音声に対して、発話者側のノイズ除去、その他のノイズキャンセリング処理などを施した上で、会議端末２００に送信するようにしてもよい。 Embodiment 3 FIG.
In the first and second embodiments, the voice information processing unit 121 transmits the voice received from the voice input unit 111 to the conference terminal 200 as it is.
The voice information processing unit 121 performs noise removal on the speaker side, other noise canceling processing, and the like on the voice received from the voice input unit 111 and transmits the voice to the conference terminal 200 as necessary. You may do it.

ここでいう発話者側のノイズとは、例えばＰＣのファン動作音などが挙げられる。
なお、ノイズキャンセリング処理を施す際には、データベース１２４に蓄積された音声データを用いて必要な統計処理や学習処理を行うとよい。 Here, the noise on the speaker side is, for example, a fan operation sound of a PC.
When performing noise canceling processing, necessary statistical processing and learning processing may be performed using audio data stored in the database 124.

実施の形態４．
特許文献１〜３に記載されているような従来の技術では、対話ロボットが以後の動作を実行する方向を絞り込むために音源位置を推定し、動作方向の候補から外れた音源に関しては、以後の処理対象から除外している。
一方、以上の実施の形態１〜３では、推定した音源位置や音量を除外するといった、音源の取捨選択は行わない。
これは、従来の技術のように対話ロボットと発話者が１対１で対話することを意識した技術と異なり、本発明では複数の発話者の音声を集音することを目的としたものであることによる。
即ち本発明では、音源位置を処理対象から除外する必要はないため、音源位置の取捨選択は行わないのである。 Embodiment 4 FIG.
In the conventional techniques as described in Patent Documents 1 to 3, the sound source position is estimated in order to narrow down the direction in which the interactive robot performs the subsequent operation. Excluded from processing.
On the other hand, in the first to third embodiments described above, sound source selection such as excluding the estimated sound source position and volume is not performed.
This is different from the technology in which the dialogue robot and the speaker talk one-on-one like the conventional technology, and the present invention aims to collect the voices of a plurality of speakers. It depends.
That is, in the present invention, since it is not necessary to exclude the sound source position from the processing target, the sound source position is not selected.

ただし、聴き手側が発した音声を発話者側で音声出力するスピーカ等の音声出力手段に関しては、発話者側の音源位置推定には不要であるため、例外的に処理対象から除外してもよい。これは、上述の各実施の形態で共通である。 However, audio output means such as a speaker for outputting the sound produced by the listener on the speaker side is not necessary for estimating the position of the sound source on the speaker side, and may be excluded as an exception. . This is common to the above-described embodiments.

実施の形態５．
図３〜図５において、音声分布マップを２次元平面座標上で表した例を説明したが、３次元空間座標上に音声分布をマッピングしてもよい。例えば、音声の大きさや発話回数を高さで表現する、といった手法が考えられる。後者の場合は、円の輪数が等高線のように用いられて高さが表現されるイメージとなる。 Embodiment 5 FIG.
3 to 5, the example in which the sound distribution map is represented on the two-dimensional plane coordinates has been described. However, the sound distribution may be mapped on the three-dimensional space coordinates. For example, a method of expressing the loudness of a voice or the number of utterances by height can be considered. In the latter case, the number of circles is used like a contour line, and the height is expressed.

さらには、音声入力部１１１の配置や音声入力ロボット１００の移動範囲を３次元に拡張してもよい。必要な移動手段は適宜設ける。
例えば遠隔会議では、発話者がノートパソコンを自己の目の前に広げて会議を行うことがあり、ノートパソコンが壁になって音声収集に影響を与える。そこで、上記のように高さ方向にも音声入力部１１１の配置や音声入力ロボット１００の移動範囲を拡張し、より柔軟な音声収集を行うことができるようにするとよい。 Furthermore, the arrangement of the voice input unit 111 and the movement range of the voice input robot 100 may be extended in three dimensions. Necessary moving means are provided as appropriate.
For example, in a remote conference, a speaker may hold a notebook computer in front of herself and hold a conference. The laptop computer acts as a wall and affects voice collection. Therefore, as described above, the arrangement of the voice input unit 111 and the movement range of the voice input robot 100 may be extended also in the height direction so that more flexible voice collection can be performed.

実施の形態６．
以上の実施の形態１〜５において、ロボット制御部１２０に、音声入力ロボット１００の自己位置を推定する機能部を設けてもよい。
例えば、図２（ａ）で説明した自走式構成の場合は、車輪の回転方向、回転数、車輪直径などの値を用いて自己位置を推定する。
図２（ｂ）で説明した固定可動式構成の場合は、アームの長さ、アームの姿勢（ヨー・ピッチ角）などの値を用いて自己位置を推定する。
自己位置推定を用いることにより、図３〜図５で説明した音声分布マップは、音声入力ロボット１００の位置を中心とした相対座標系ではなく、絶対座標系のマップとなる。絶対座標軸上の音源における最大／最小音量、発話発生頻度などに基づき、絶対座標上における音声入力ロボット１００の理想位置が求められる。
これにより、音声入力ロボット１００の理想位置を素早く判断することができる。 Embodiment 6 FIG.
In the above first to fifth embodiments, the robot control unit 120 may be provided with a function unit that estimates the self-position of the voice input robot 100.
For example, in the case of the self-propelled configuration described with reference to FIG. 2A, the self-position is estimated using values such as the rotation direction of the wheel, the number of rotations, and the wheel diameter.
In the case of the fixed movable configuration described with reference to FIG. 2B, the self-position is estimated using values such as the length of the arm and the posture of the arm (yaw / pitch angle).
By using self-position estimation, the voice distribution map described with reference to FIGS. 3 to 5 is not a relative coordinate system centered on the position of the voice input robot 100 but an absolute coordinate system map. Based on the maximum / minimum volume of the sound source on the absolute coordinate axis, the utterance occurrence frequency, and the like, the ideal position of the voice input robot 100 on the absolute coordinate is obtained.
Thereby, the ideal position of the voice input robot 100 can be quickly determined.

実施の形態７．
以上の実施の形態１〜６では、説明の便宜上、発話者側に音声入力ロボット１００を設置し、聴き手側に会議端末２００を設置した例を説明した。
しかし、遠隔会議のような双方向のコミュニケーションでは、双方が発話を行うので、双方の拠点に音声入力ロボット１００と会議端末２００を設置して同等の環境となるように構成してもよい。 Embodiment 7 FIG.
In the above first to sixth embodiments, for the sake of convenience of explanation, the example in which the voice input robot 100 is installed on the speaker side and the conference terminal 200 is installed on the listener side has been described.
However, in two-way communication such as a remote conference, since both parties speak, the voice input robot 100 and the conference terminal 200 may be installed at both bases so that the environment is equivalent.

実施の形態１に係る遠隔会議支援システムの構成図である。1 is a configuration diagram of a remote conference support system according to Embodiment 1. FIG. 音声入力ロボット１００の外観構成例を示す図である。1 is a diagram illustrating an example of an external configuration of a voice input robot 100. FIG. 音源位置のみを基準として作成した音声分布マップの例である。It is an example of the audio | voice distribution map produced on the basis of only the sound source position. 音源位置と各音源の音量を基準として作成した音声分布マップの例である。It is an example of the audio | voice distribution map produced on the basis of the sound source position and the volume of each sound source. 音源位置、各音源の音量、および各音源の音声発生頻度を基準として作成した音声分布マップの例である。It is an example of the audio | voice distribution map produced on the basis of the sound source position, the sound volume of each sound source, and the sound generation frequency of each sound source. 聴き手側の望む音声状況になるように音声入力ロボット１００を動作させ、音声環境を改善する動作フローである。This is an operation flow for improving the voice environment by operating the voice input robot 100 so that the voice situation desired by the listener is obtained. 実施の形態２に係る遠隔会議支援システムの構成図である。6 is a configuration diagram of a remote conference support system according to Embodiment 2. FIG. 表示部１１３の構成例を示す図である。4 is a diagram illustrating a configuration example of a display unit 113. FIG. 実施の形態２において、聴き手側の望む音声状況になるように音声入力ロボット１００を動作させ、音声環境を改善する動作フローである。In Embodiment 2, it is the operation | movement flow which operates the audio | voice input robot 100 so that it may become the audio | voice condition which a listener side desires, and improves audio | voice environment.

Explanation of symbols

１００音声入力ロボット、１１０ロボット本体部、１１１音声入力部、１１２動作部、１１３表示部、１２０ロボット制御部、１２１音声情報処理部、１２２統計処理部、１２３動作決定部、１２４データベース、１２５設定部。 DESCRIPTION OF SYMBOLS 100 Voice input robot, 110 Robot main-body part, 111 Voice input part, 112 Operation | movement part, 113 Display part, 120 Robot control part, 121 Voice information processing part, 122 Statistical processing part, 123 Motion determination part, 124 Database, 125 Setting part .

Claims

A voice input unit that accepts voice input;
A sound source position estimating unit for estimating a sound source position of the sound received by the sound input unit;
An operation unit for changing a position of the voice input unit;
With
The sound source position estimating unit
Estimating sound source positions of a plurality of voices received by the voice input unit;
The operating unit is
Based on the estimation result of the sound source position estimation unit,
The positional relationship between the voice input unit and the sound source positions of the plurality of voices is changed by moving the position of the voice input unit in a direction in which the collected sound volumes of the voices received by the voice input unit are equal. A voice input robot characterized by

The voice input robot according to claim 1, further comprising a storage unit that stores a database that holds estimation results of the sound source position estimation unit in time series order.

The voice input robot according to claim 2, further comprising a statistical processing unit that statistically processes the estimation result held in the database.

The statistical processing unit
4. The voice input robot according to claim 3, wherein the result of the statistical processing is mapped onto two-dimensional plane coordinates or three-dimensional space coordinates around the voice input robot, and the mapping result is stored in the storage unit. .

The operating unit is
The voice input robot according to claim 4, wherein the positional relationship is changed based on the mapping result.

A setting input unit for receiving setting information specifying the positional relationship;
The operating unit is
Based on the mapping result,
The positional relationship is the positional relationship specified by the setting information,
The voice input robot according to claim 4, wherein the position of the voice input unit is variable.

The voice input robot according to claim 1, further comprising means for notifying a sound collection state of the voice input unit for each voice.

The voice input robot according to claim 1, further comprising a display unit that displays a sound collection state of the voice input unit for each of the voices.

The voice input robot according to claim 1, further comprising a display unit that displays a variable direction of the operation unit before variable execution.

The voice input robot according to any one of claims 1 to 9 ,
A terminal having an audio output unit for outputting audio;
Have
The voice input robot and the terminal are connected via a network,
The voice input robot is
Transmitting the voice received by the voice input unit to the terminal via the network;
The terminal
A remote conference support system, which receives the voice and outputs the voice from the voice output unit.

A method for supporting a remote conference,
A voice input unit that accepts voice input;
A sound source position estimating unit for estimating a sound source position of the sound received by the sound input unit;
An operation unit for changing a position of the voice input unit;
Place a voice input robot equipped with in the conference space,
Estimating a plurality of sound source positions received by the sound input unit;
Based on the estimation result of the sound source position estimation unit , the voice input unit and the plurality of voices are moved by moving the position of the voice input unit in a direction in which the collected sound volumes of the voices received by the voice input unit are equal. Changing the positional relationship between the sound source positions of
A remote conference support method comprising the steps of: