JP2012108587A

JP2012108587A - Voice communication device and voice communication method

Info

Publication number: JP2012108587A
Application number: JP2010254801A
Authority: JP
Inventors: Nobuhiro Kanbe; 信裕神戸
Original assignee: Panasonic Corp
Current assignee: Panasonic Corp
Priority date: 2010-11-15
Filing date: 2010-11-15
Publication date: 2012-06-07
Also published as: WO2012066734A1

Abstract

PROBLEM TO BE SOLVED: To provide a voice output device capable of realizing a comfortable conversation environment even when a conversation group is fluid.SOLUTION: A voice communication terminal 100 is a device for controlling voice output of at least one of plural terminals participating in a multi-point voice communication system, and includes a voice arrangement unit 150 for setting a voice source arrangement when voice is output from other terminals, and a co-speaker management unit 140 for detecting an utterer and a co-speaker who is a partner with the utterer in a conversation from the plural terminals and detecting a conversation group on the basis of the combination of the detected utterer and co-speaker. The voice arrangement unit 150 changes the setting of the voice source arrangement according to change in the detected conversation group.

Description

本発明は、多地点音声通信システムに参加する端末の音声出力を制御する、音声コミュニケーション装置および音声コミュニケーション方法に関する。 The present invention relates to a voice communication apparatus and a voice communication method for controlling voice output of a terminal participating in a multipoint voice communication system.

近年のコミュニケーション手段は、テレビ電話や電子メール等の視覚を主体としたもの、電話等の聴覚を主体としたもの等、多種多様化している。モバイル環境、特に歩行時等の移動中に用いられるコミュニケーション手段には、視覚を用いるものよりも聴覚を用いるものの方が適している。 In recent years, there are a wide variety of communication means such as video telephones and e-mails that are mainly visual, and telephones that are mainly auditory. As a communication means used in a mobile environment, particularly during movement such as walking, a device using hearing is more suitable than a device using vision.

音声コミュニケーションの形態としては、一対一の会話だけでなく、複数人による音声チャットや電話会議等のいわゆる多地点音声通信がある。近年の通信技術の発達により、高品質の音声をより多くの地点に送信することが可能となっており、大勢の発話音声を一斉に受信して出力することが可能となってきている。ところが、このように大勢の発話音声が一斉に出力される場合、発話者を区別して発話音声を聞き分けることが難しく、会話の内容を把握することが困難となる。 As a form of voice communication, there are not only one-on-one conversation but also so-called multipoint voice communication such as voice chat and telephone conference by a plurality of people. With recent developments in communication technology, it has become possible to transmit high-quality speech to more points, and it has become possible to receive and output a large number of spoken speech simultaneously. However, when a large number of uttered voices are output all at once, it is difficult to distinguish the uttered voices by distinguishing the utterers, and it is difficult to grasp the contents of the conversation.

そこで、音源を仮想空間に配置する技術が、例えば、特許文献１および特許文献２で知られている。特許文献１および特許文献２記載の技術は、マウスやジョイスティック等による操作を受けて、チャットルーム等を模した画面上で各発話者のアイコンを移動させる。そして、特許文献１および特許文献２記載の技術は、各発話者に対応する音源を、仮想空間における各アイコンの位置に基づいて立体的に配置する。 Thus, for example, Patent Document 1 and Patent Document 2 disclose a technique for arranging a sound source in a virtual space. In the techniques described in Patent Document 1 and Patent Document 2, an icon of each speaker is moved on a screen simulating a chat room or the like in response to an operation with a mouse or a joystick. And the technique of patent document 1 and patent document 2 arrange | positions the sound source corresponding to each speaker three-dimensionally based on the position of each icon in virtual space.

特許文献１および特許文献２記載の技術は、仮想的な音源位置における方向や距離に応じた聞こえ方となるように、音声出力を制御する。また、特許文献２記載の技術は、更に、音源配置と発話者の顔の向きとの関係から、誰が誰に話し掛けているのかを検出し、話し掛けている相手に対しては発話音声を大きめに出力する。これらの従来技術によれば、発話者毎に発話音声が異なる方向および音量で聞こえるため、発話者を区別して発話音声を聞くことが容易となり、会話の内容を把握し易くすることができる。 The techniques described in Patent Document 1 and Patent Document 2 control the sound output so that the sound is heard according to the direction and distance at the virtual sound source position. Further, the technique described in Patent Document 2 further detects who is speaking to who from the relationship between the sound source arrangement and the direction of the face of the speaker, and makes the utterance voice larger for the speaking partner. Output. According to these prior arts, since the uttered sound can be heard in different directions and volumes for each utterer, it is easy to hear the uttered sound by distinguishing the utterers, and the contents of the conversation can be easily grasped.

特開２００９−４３２７４号公報JP 2009-43274 A 特開２００１−２７４９１２号公報JP 2001-274912 A

ところで、共通の話題の会話を構成する発話者の端末のグループ（以下「会話グループ」という）が存在しているにもかかわらず、音源が会話グループ毎にまとまって配置されていない場合がある。このような場合、ユーザは、個々の発話音声がどの会話グループに属するのかを把握し辛くなり、話題に追従することが難しくなる。会話グループが固定的である場合には、通常、アイコン等の位置がまとまっている箇所で会話が行われるため、このような問題は生じない。 By the way, there is a case where sound sources are not arranged in a group for each conversation group even though there is a group of speaker terminals (hereinafter referred to as “conversation group”) constituting a conversation of a common topic. In such a case, it becomes difficult for the user to grasp which conversation group each utterance voice belongs to, and it is difficult to follow the topic. When the conversation group is fixed, such a problem does not occur because the conversation is usually performed at a location where icons and the like are arranged.

しかしながら、多地点音声通信の適用の幅が広がると、会話の参加者が、複数の会話グループを切り替えながら会話の流れに乗って発言したいと望むことが考えられる。この場合、会話グループは流動的であることが望ましい。したがって、会話グループが流動的であっても、個々の発話音声がどの会話グループに属するかを把握でき、話題に追従することが容易であるような、快適な会話環境を得られることが求められる。 However, as the application range of multipoint voice communication expands, it is conceivable that a conversation participant desires to speak on the flow of conversation while switching a plurality of conversation groups. In this case, it is desirable that the conversation group is fluid. Therefore, even if the conversation group is fluid, it is necessary to be able to grasp which conversation group each utterance voice belongs to, and to obtain a comfortable conversation environment that makes it easy to follow the topic. .

本発明の目的は、会話グループが流動的であっても、快適な会話環境を実現することができる音声コミュニケーション装置および音声コミュニケーション方法を提供することである。 An object of the present invention is to provide a voice communication device and a voice communication method capable of realizing a comfortable conversation environment even when a conversation group is fluid.

本発明の音声コミュニケーション装置は、多地点音声通信システムに参加する複数の端末のうち少なくとも１つの音声出力を制御する音声コミュニケーション装置であって、他の端末からの音声が出力される際の音源配置を設定する音声配置部と、前記複数の端末の中から、発話者とその相手である対話者とを検出し、検出された前記発話者および前記対話者の組み合わせに基づいて会話グループを検出する対話者管理部とを有し、前記音声配置部は、検出された前記会話グループの変化に応じて前記音源配置の設定を変更する。 The voice communication apparatus of the present invention is a voice communication apparatus that controls at least one voice output among a plurality of terminals participating in a multipoint voice communication system, and a sound source arrangement when voices from other terminals are output And detecting a speaker and a conversation partner as a partner from the plurality of terminals, and detecting a conversation group based on the detected combination of the speaker and the conversation person The voice placement unit changes the setting of the sound source placement according to the detected change in the conversation group.

本発明の音声コミュニケーション方法は、多地点音声通信システムに参加する複数の端末のうち少なくとも１つの音声出力を制御する音声コミュニケーション方法であって、前記複数の端末の中から、発話者とその相手である対話者とを検出し、検出された前記発話者および前記対話者の組み合わせに基づいて会話グループを検出するステップと、検出された前記会話グループの変化に応じて、他の端末からの音声が出力される際の音源配置の設定を変更するステップとを有する。 A voice communication method of the present invention is a voice communication method for controlling at least one voice output among a plurality of terminals participating in a multi-point voice communication system, and includes a speaker and an opponent among the plurality of terminals. Detecting a conversation person, detecting a conversation group based on the detected combination of the speaker and the conversation person, and according to the detected change of the conversation group, a voice from another terminal is detected. And changing the setting of the sound source arrangement at the time of output.

本発明によれば、会話グループが流動的であっても快適な会話環境を実現することができる。 According to the present invention, a comfortable conversation environment can be realized even if the conversation group is fluid.

本発明の一実施の形態に係る音声コミュニケーション装置を含む音声コミュニケーション端末の構成例を示すブロック図The block diagram which shows the structural example of the voice communication terminal containing the voice communication apparatus which concerns on one embodiment of this invention 本実施の形態における方向の概念を説明するための模式図Schematic diagram for explaining the concept of direction in the present embodiment 本実施の形態に係る音声コミュニケーション端末の動作の一例を示すフローチャートThe flowchart which shows an example of operation | movement of the voice communication terminal which concerns on this Embodiment. 本実施の形態における情報送信処理を示すフローチャートFlowchart showing information transmission processing in the present embodiment 本実施の形態における送信データの構成の一例を示す図The figure which shows an example of the structure of the transmission data in this Embodiment. 本実施の形態における音声制御処理を示すフローチャートFlow chart showing voice control processing in the present embodiment 本実施の形態における音源配置の一例を示す図The figure which shows an example of the sound source arrangement | positioning in this Embodiment 本実施の形態における配置データの一例を示す図The figure which shows an example of the arrangement | positioning data in this Embodiment 本実施の形態における音源配置の変更の様子の一例を示す図The figure which shows an example of the mode of a sound source arrangement | positioning change in this Embodiment 本実施の形態における変更された配置データの一例を示す図The figure which shows an example of the arrangement data changed in this Embodiment 本実施の形態における変更された配置データの他の例を示す図The figure which shows the other example of the arrangement data changed in this Embodiment 本実施の形態における各音声コミュニケーション端末に設定される音源配置の一例を示す図The figure which shows an example of the sound source arrangement | positioning set to each audio | voice communication terminal in this Embodiment

以下、本発明の一実施の形態について、図面を参照して詳細に説明する。本実施の形態は、不特定多数が参加して任意に会話グループを形成することができる、チャットシステムに適用した例である。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. This embodiment is an example applied to a chat system in which an unspecified number of people can participate and arbitrarily form a conversation group.

図１は、本発明の一実施の形態に係る音声コミュニケーション装置を含む音声コミュニケーション端末の構成例を示すブロック図である。 FIG. 1 is a block diagram showing a configuration example of a voice communication terminal including a voice communication apparatus according to an embodiment of the present invention.

図１において、音声コミュニケーション端末１００は、音声情報送受信部１１０、音声入力部１２０、方向取得部１３０、対話者管理部１４０、音声配置部１５０、および音声出力部１６０を有する。 In FIG. 1, the voice communication terminal 100 includes a voice information transmission / reception unit 110, a voice input unit 120, a direction acquisition unit 130, a talker management unit 140, a voice placement unit 150, and a voice output unit 160.

音声情報送受信部１１０は、例えば、インターネットに接続するためのネットワークデバイスを有し、音声コミュニケーションサーバ３００と通信を行う。音声コミュニケーションサーバ３００は、例えばインターネット上に配置された、複数の音声コミュニケーション端末１００の間で音声データの転送を行うサーバである。 The voice information transmitting / receiving unit 110 includes, for example, a network device for connecting to the Internet, and communicates with the voice communication server 300. The voice communication server 300 is a server that transfers voice data among a plurality of voice communication terminals 100 disposed on the Internet, for example.

本実施の形態において、音声コミュニケーションサーバ３００は、ある音声コミュニケーション端末１００から音声データを受信したとき、受信した音声データを、他の全ての音声コミュニケーション端末１００へ転送するものとする。 In the present embodiment, when the voice communication server 300 receives voice data from a certain voice communication terminal 100, the voice communication server 300 transfers the received voice data to all other voice communication terminals 100.

音声入力部１２０は、有線または無線により接続する音声入力装置２００から、ユーザの発話音声を含む音声の電気信号（以下「音声信号」という）を受信する。音声入力部１２０は、受信した音声信号を、Ａ／Ｄコンバータにより、デジタル信号の音声データへと変換する。そして、音声入力部１２０は、音声データを、音声情報送受信部１１０を用いて音声コミュニケーションサーバ３００へ送信する。以下、音声入力部１２０が生成する音声データは、「自端末音声データ」という。 The voice input unit 120 receives a voice electrical signal (hereinafter referred to as a “voice signal”) including a user's uttered voice from the voice input device 200 connected by wire or wirelessly. The audio input unit 120 converts the received audio signal into audio data of a digital signal by an A / D converter. The voice input unit 120 transmits the voice data to the voice communication server 300 using the voice information transmission / reception unit 110. Hereinafter, the voice data generated by the voice input unit 120 is referred to as “own terminal voice data”.

また、音声入力部１２０は、送信すべき音声データが生成される毎に、その旨を、対話者管理部１４０へ通知する。なお、送信すべき音声データが生成されたか否かは、例えば、ユーザが発話時に押下するボタンの操作の有無や、音声信号の電圧が閾値を超えているか否かに基づいて、判断することができる。 In addition, every time voice data to be transmitted is generated, the voice input unit 120 notifies the dialogue manager 140 of that fact. Note that whether or not voice data to be transmitted has been generated can be determined based on, for example, whether or not a user presses a button pressed during speech and whether or not the voltage of the voice signal exceeds a threshold value. it can.

本実施の形態において、音声入力装置２００は、例えば、ヘッドセットのマイクロフォンであり、入力された音声を音声信号に変換する装置である。 In the present embodiment, the voice input device 200 is, for example, a microphone of a headset, and is a device that converts input voice into a voice signal.

方向取得部１３０は、例えば、モーションセンサを有し、ユーザの動きを感知して、ユーザの基本姿勢を基準としたときのユーザの顔の向きを算出する。そして、方向取得部１３０は、例えば対話者管理部１４０からの要求を受ける毎に、算出した顔の向きを、方向データとして、対話者管理部１４０および音声配置部１５０へ出力する。方向データは、つまり、ユーザの基本姿勢を基準とした顔の向き（例えば前方、左方、右方等）を示す情報である。 The direction acquisition unit 130 includes, for example, a motion sensor, detects the user's movement, and calculates the orientation of the user's face when the user's basic posture is used as a reference. Then, the direction acquisition unit 130 outputs the calculated face orientation to the dialogue manager management unit 140 and the voice placement unit 150 as direction data each time a request is received from the dialogue manager management unit 140, for example. In other words, the direction data is information indicating the direction of the face (for example, forward, left, right, etc.) based on the basic posture of the user.

対話者管理部１４０は、音声入力部１２０から自端末音声データ生成の通知を受ける毎に、方向取得部１３０に対して方向データを要求する。そして、対話者管理部１４０は、方向取得部１３０から入力される方向データと、音声配置部１５０が保持する後述の配置データとの関係から、ユーザの会話相手（以下「対話者」という）を判定し、対話者情報を生成する。具体的には、対話者管理部１４０は、ユーザが発話を行っているときにユーザが向いている方向を特定し、その方向に配置されている端末のユーザを、対話者と判定する。 The conversation manager 140 requests direction data from the direction acquisition unit 130 every time it receives a notification of its own terminal voice data generation from the voice input unit 120. Then, the conversation manager 140 determines the user's conversation partner (hereinafter referred to as “interactive person”) from the relationship between the direction data input from the direction acquisition unit 130 and the arrangement data described later held by the voice arrangement unit 150. Judge and generate dialogue information. Specifically, the conversation manager 140 identifies the direction in which the user is facing when the user is speaking, and determines the user of the terminal arranged in that direction as the conversation person.

配置データとは、端末毎に設定された位置の集合である。位置とは、他の音声コミュニケーション端末１００の端末ＩＤ（以下「他端末ＩＤ」という）と、他端末ＩＤに対して設定された音源の位置と、他端末の会話の向きである指向性情報の組から成る情報である。端末ＩＤは、音源位置を区別すべき対象毎に設定された識別情報であり、例えば、ユーザＩＤでも良いし、機器ＩＤやネットワークＩＤでも良い。また、他端末ＩＤに対して設定された音源の位置とは、例えば、前方、左方、右方等を示す。会話の向きとは、その他端末がどの端末に向いて会話を行っているかを、各音源の相対的な位置関係における向きで示す情報である。本実施の形態における方向の概念については後述する。 The arrangement data is a set of positions set for each terminal. The position refers to the terminal ID of the other voice communication terminal 100 (hereinafter referred to as “other terminal ID”), the position of the sound source set for the other terminal ID, and directivity information that is the direction of conversation of the other terminal. Information consisting of a set. The terminal ID is identification information set for each target whose sound source position should be distinguished, and may be a user ID, a device ID, or a network ID, for example. Moreover, the position of the sound source set for the other terminal ID indicates, for example, forward, left, right. The direction of conversation is information indicating which terminal the other terminal is talking to by the direction in the relative positional relationship of each sound source. The concept of direction in the present embodiment will be described later.

対話者情報は、音声コミュニケーション端末１００の端末ＩＤである送信元ＩＤと対話者の端末ＩＤとの組（以下、適宜「会話ペア」という）から成る情報である。すなわち、会話ペアとは、話し掛ける側のユーザ（音声コミュニケーション端末１００）と、話し掛けられる側のユーザ（音声コミュニケーション端末１００）との組である。以下、音声コミュニケーション端末１００の端末ＩＤは「自端末ＩＤ」といい、対話者の他端末ＩＤは、「対話者端末ＩＤ」という。また、送信元ＩＤが示す端末は「送信元」といい、対話者端末ＩＤが示す端末は「対話者端末」という。 The conversation person information is information including a set of a transmission source ID that is a terminal ID of the voice communication terminal 100 and a conversation person's terminal ID (hereinafter referred to as “conversation pair” as appropriate). That is, the conversation pair is a set of a user who talks (voice communication terminal 100) and a user who talks (voice communication terminal 100). Hereinafter, the terminal ID of the voice communication terminal 100 is referred to as “own terminal ID”, and the other terminal ID of the conversation person is referred to as “interaction person terminal ID”. The terminal indicated by the transmission source ID is referred to as “transmission source”, and the terminal indicated by the conversation person terminal ID is referred to as “interaction person terminal”.

そして、対話者管理部１４０は、生成した対話者情報を、音声入力部１２０が送信する音声データに付加させることにより、音声情報送受信部１１０を用いて音声コミュニケーションサーバ３００へ送信する。すなわち、対話者管理部１４０は、対話者情報を、音声コミュニケーションサーバ３００を介して他の音声コミュニケーション端末１００へ送信する。 Then, the conversation manager 140 transmits the generated conversation information to the voice communication server 300 using the voice information transmitter / receiver 110 by adding it to the voice data transmitted by the voice input unit 120. That is, the dialogue manager 140 transmits the dialogue information to the other voice communication terminal 100 via the voice communication server 300.

また、対話者管理部１４０は、他の音声コミュニケーション端末１００から同様に音声コミュニケーションサーバ３００を介して、音声データと共に送られてくる対話者情報を、音声情報送受信部１１０を用いて受信する。そして、対話者管理部１４０は、自己が生成した対話者情報と他の音声コミュニケーション端末１００からの対話者情報とを、生成時刻および受信時刻から一定の期間、対話者データとして保持する。 In addition, the conversation person management unit 140 receives the conversation person information transmitted together with the sound data from the other sound communication terminals 100 through the sound communication server 300 in the same manner, using the sound information transmission / reception part 110. Then, the dialogue manager 140 holds the dialogue information generated by itself and the dialogue information from the other voice communication terminal 100 as dialogue data for a certain period from the generation time and the reception time.

音声配置部１５０は、対話者管理部１４０が保持する対話者データに基づいて、各音源の位置および向きを算出する。具体的には、音声配置部１５０は、受信した対話者情報に基づいて、会話グループを構成する音源がまとまるように配置を決定するとともに、配置された音源ごとに対話者の方向となる指向性を算出する。より具体的には、音声配置部１５０は、受信した対話者情報の会話ペアの位置が近くなるように、配置を決定する。そして、音声配置部１５０は、配置データを生成し、対話者管理部１４０からの要求を受ける毎に、対話者管理部１４０へ出力する。 The voice placement unit 150 calculates the position and orientation of each sound source based on the talker data held by the talker management unit 140. Specifically, the voice placement unit 150 determines placement based on the received talker information so that sound sources constituting the conversation group are gathered, and directivity that is the direction of the talker for each placed sound source. Is calculated. More specifically, the voice placement unit 150 determines the placement so that the positions of the conversation pairs in the received talker information are close. Then, the voice placement unit 150 generates placement data and outputs it to the talker management unit 140 every time a request is received from the talker management unit 140.

また、音声配置部１５０は、音声コミュニケーションサーバ３００から送られてくる音声データを、音声情報送受信部１１０を用いて受信する。以下、音声配置部１５０が受信する音声データは、「他端末音声データ」という。音声配置部１５０は、方向データおよび配置データに従い、音声データに付加された対話者情報に含まれる送信元ＩＤに基づいて、配置データが示す各音源の位置および向きで立体的に音源が配置されるように、他端末音声データを処理する。そして、音声配置部１５０は、処理後の他端末音声データを、音声出力部１６０へ出力する。 In addition, the voice placement unit 150 receives voice data transmitted from the voice communication server 300 using the voice information transmission / reception unit 110. Hereinafter, the audio data received by the audio placement unit 150 is referred to as “other terminal audio data”. In accordance with the direction data and the arrangement data, the audio arrangement unit 150 arranges the sound sources in three dimensions at the positions and orientations of the respective sound sources indicated by the arrangement data, based on the transmission source ID included in the talker information added to the audio data. As described above, the other terminal voice data is processed. Then, the voice placement unit 150 outputs the processed other terminal voice data to the voice output unit 160.

音声出力部１６０は、入力された他端末音声データを、Ｄ／Ａコンバータにより音声信号に変換し、有線または無線により接続する音声出力装置４００へ送信する。 The audio output unit 160 converts the input other-terminal audio data into an audio signal by a D / A converter, and transmits the audio signal to the audio output device 400 connected by wire or wirelessly.

本実施の形態において、音声出力装置４００は、例えばヘッドセットのステレオヘッドフォンであり、入力された音声信号を音声に変換する装置である。 In the present embodiment, the audio output device 400 is, for example, a stereo headphone of a headset, and is a device that converts an input audio signal into audio.

図２は、本実施の形態における方向の概念を説明するための模式図である。 FIG. 2 is a schematic diagram for explaining the concept of directions in the present embodiment.

音声配置部１５０は、ユーザ５１０の基本姿勢を基準として、ユーザ５１０の周囲に想定した仮想的な空間に、他端末ＩＤをユーザ５１０に対して「前方」や「左方」等に配置する。また、その発話音声が聞こえてくる方向は、ユーザ５１０の顔の向き（つまり、他のどの端末に話し掛けているか）により変化する。 The voice placement unit 150 places other terminal IDs “forward”, “leftward”, etc. with respect to the user 510 in a virtual space assumed around the user 510 with the basic posture of the user 510 as a reference. Further, the direction in which the uttered voice is heard varies depending on the direction of the face of the user 510 (that is, which other terminal is talking to).

例えば、ある発話者５２０_１の他端末ＩＤに対して、ユーザ５１０の「前方」が設定されたとする。この場合は、後述の通り、ユーザ５１０の基本姿勢における前方から発話者５２０_１の発話音声が聞こえるように、音声出力の配置が制御される。そして、例えば、この状態でユーザ５１０が顔を左に向けた場合には、右耳側から発話者５２０_１の発話音声が聞こえるように、音声出力の配置が制御される。これにより、本実施の形態は、周辺に位置する他の発話者５２０と、前方に位置する発話者５２０_１との音声を判別しやすくなる。 For example, with respect to the other terminal ID of a speaker 520 _1, the "front" of the user 510 has been set. In this case, as described later, so hear the speaker 520 ₁ of speech from the front in the basic posture of the user 510, the arrangement of the sound output is controlled. Then, for example, the user 510 in this state when the face against the left, as hear speech of a speaker 520 ₁ from the right ear side, the arrangement of the sound output is controlled. Thus, the present embodiment, the other speaker 520 located around, easily determine the sound and speaker 520 ₁ located in front.

更に、ある発話者５２０_２は、別の発話者５２０_３に話しかけているように、発話者５２０_２に発話者５２０_３の方向への音声の指向性を設定する。すなわちユーザ５１０には、右前の発話者５２０_２が右の発話者５２０_３に向かって話しているように聴こえる。 Furthermore, there is the speaker 520 _2, as talking to another speaker 520 _3, sets the audio directivity in the direction of the speaker 520 ₃ to speaker 520 _2. That the user 510, sounds as speakers 520 ₂ of the right is speaking to the right of the speaker 520 _3.

また、ユーザ５１０は、話を聞きたい相手や話し掛けたい相手の方向に、自然と頭を向ける。したがって、ユーザ５１０の顔の向きは、対話者の方向を示す情報となる。 In addition, the user 510 naturally turns his head toward the direction of the person who wants to hear or talk. Therefore, the orientation of the face of the user 510 is information indicating the direction of the conversation person.

顔の向きおよび音源の方向は、例えば、方位角と仰伏角とで定義される。ここでは仰伏角は０とし、顔の向きおよび音源の方向として方位角のみが用いられるものとする。これは、一般的に、左右方向の方が、前後方向や上下方向に比べて識別が容易であるためである。 The direction of the face and the direction of the sound source are defined by, for example, an azimuth angle and an elevation angle. Here, it is assumed that the elevation angle is 0 and only the azimuth angle is used as the face direction and the sound source direction. This is because, generally, identification in the left-right direction is easier than in the front-rear direction and the up-down direction.

このような音声コミュニケーション端末１００は、各ユーザの顔の向きに基づいて対話者を特定すると共に、他の音声コミュニケーション端末１００から受信した対話者情報に基づいて会話ペアを取得する。そして、音声コミュニケーション端末１００は、会話グループ、つまり、会話の組み合わせが変化したとき、これを検出し、会話グループがまとまった方向から聞こえるように音声出力を制御する。これにより、音声コミュニケーション端末１００は、会話グループが流動的であっても、常に音源配置を会話グループ毎にまとめることができるので、会話内容を容易に把握することを可能にし、快適な会話環境を実現することができる。 Such a voice communication terminal 100 specifies a talker based on the orientation of each user's face, and acquires a conversation pair based on the talker information received from another voice communication terminal 100. The voice communication terminal 100 detects when a conversation group, that is, a combination of conversations has changed, and controls voice output so that the conversation group can be heard from a grouped direction. Thereby, since the voice communication terminal 100 can always arrange the sound source arrangement for each conversation group even if the conversation group is fluid, it is possible to easily grasp the contents of the conversation and to create a comfortable conversation environment. Can be realized.

次に、音声コミュニケーション端末１００の動作について説明する。 Next, the operation of the voice communication terminal 100 will be described.

図３は、音声コミュニケーション端末１００の動作の一例を示すフローチャートである。 FIG. 3 is a flowchart showing an example of the operation of the voice communication terminal 100.

まず、ステップＳ１０００において、音声入力部１２０は、操作インタフェース（図示せず）におけるユーザ操作等による動作の終了の要求があったか否かを判断する。音声入力部１２０は、終了の要求が無い場合には（Ｓ１０００：ＮＯ）、ステップＳ２０００へ進む。 First, in step S1000, the voice input unit 120 determines whether or not there has been a request for termination of an operation by a user operation or the like in an operation interface (not shown). If there is no termination request (S1000: NO), the voice input unit 120 proceeds to step S2000.

ステップＳ２０００において、音声入力部１２０は、音声入力装置２００から新たに音声信号を受信したか否かを判断する。音声入力部１２０は、例えば、一定以上の電圧の音声信号が入力されているときや、音声入力スイッチがオンとなっている状態のときに、音声信号を受信していると判定する。音声入力部１２０は、音声信号を受信した場合には（Ｓ２０００：ＹＥＳ）、ステップＳ３０００へ進む。また、音声入力部１２０は、音声信号を受信していない場合には（Ｓ２０００：ＮＯ）、ステップＳ４０００へ進む。 In step S2000, the voice input unit 120 determines whether or not a new voice signal has been received from the voice input device 200. The voice input unit 120 determines that a voice signal is received, for example, when a voice signal having a voltage higher than a certain level is input or when the voice input switch is on. If the audio input unit 120 receives an audio signal (S2000: YES), the process proceeds to step S3000. If the audio input unit 120 has not received an audio signal (S2000: NO), the process proceeds to step S4000.

ステップＳ３０００において、音声入力部１２０および対話者管理部１４０は、自端末音声データを他の音声コミュニケーション端末１００へ送信する情報送信処理を実行して、ステップＳ４０００へ進む。情報送信処理の詳細については後述する。 In step S3000, the voice input unit 120 and the talker management unit 140 execute an information transmission process of transmitting the own terminal voice data to the other voice communication terminal 100, and the process proceeds to step S4000. Details of the information transmission process will be described later.

ステップＳ４０００において、対話者管理部１４０は、新たな他端末音声データを他の音声コミュニケーション端末１００から受信したか否かを判断する。対話者管理部１４０は、他端末音声データを受信した場合には（Ｓ４０００：ＹＥＳ）、ステップＳ５０００へ進む。また、対話者管理部１４０は、他端末音声データを受信していない場合には（Ｓ４０００：ＮＯ）、ステップＳ１０００へ戻る。 In step S4000, the talker management unit 140 determines whether new other terminal voice data has been received from another voice communication terminal 100. When the other-party voice data is received (S4000: YES), the talker management unit 140 proceeds to step S5000. In addition, when the other party voice data is not received (S4000: NO), the conversation manager management unit 140 returns to step S1000.

ステップＳ５０００において、対話者管理部１４０、音声配置部１５０、および音声出力部１６０は、受信した他端末音声データに基づく音声出力を制御する音声制御処理を実行して、ステップＳ１０００へ戻る。音声制御処理の詳細については後述する。 In step S5000, the talker management unit 140, the voice placement unit 150, and the voice output unit 160 execute voice control processing for controlling voice output based on the received other terminal voice data, and the process returns to step S1000. Details of the voice control processing will be described later.

そして、音声入力部１２０は、終了の要求があると（Ｓ１０００：ＹＥＳ）、一連の動作を終了する。 Then, when there is a request for termination (S1000: YES), the voice input unit 120 terminates the series of operations.

なお、情報送信処理および音声制御処理は、別のスレッドで同時に実行されても良い。 Note that the information transmission process and the voice control process may be executed simultaneously in different threads.

図４は、情報送信処理（図３のステップＳ３０００）を示すフローチャートである。 FIG. 4 is a flowchart showing the information transmission process (step S3000 in FIG. 3).

ステップＳ３１００において、音声入力部１２０は、音声入力装置２００から入力された音声信号を自端末音声データに変換する。また、音声入力部１２０は、送信すべき自端末音声データが生成された旨を、対話者管理部１４０へ通知する。 In step S3100, the voice input unit 120 converts the voice signal input from the voice input device 200 into its own terminal voice data. In addition, the voice input unit 120 notifies the dialog manager 140 that the own terminal voice data to be transmitted has been generated.

そして、ステップＳ３２００において、対話者管理部１４０は、通知を受けて、方向取得部１３０から方向データを取得し、音声配置部１５０から配置データを取得する。 In step S <b> 3200, the talker management unit 140 receives the notification, acquires direction data from the direction acquisition unit 130, and acquires arrangement data from the voice arrangement unit 150.

そして、ステップＳ３３００において、対話者管理部１４０は、方向データと配置データとを照合する。すなわち、対話者管理部１４０は、方向データが示すユーザの顔の方向と、他端末ＩＤに設定されている位置（方向）とを照合する。 Then, in step S3300, conversation person manager 140 collates the direction data and the arrangement data. That is, the talker management unit 140 collates the direction of the user's face indicated by the direction data with the position (direction) set in the other terminal ID.

そして、ステップＳ３４００において、対話者管理部１４０は、照合結果から、ユーザが誰かと会話をしているか否かを判断する。すなわち、対話者管理部１４０は、ユーザの対話者が存在するか否かを判断する。この判断は、いずれかの端末ＩＤに設定された位置が、方向データが示すユーザの顔の方向を基準とする所定の角度範囲内に含まれているか否かに基づいて行われる。対話者管理部１４０は、対話者が存在する場合（Ｓ３４００：ＹＥＳ）、ステップＳ３５００へ進む。また、対話者管理部１４０は、対話者が存在しない場合（Ｓ３４００：ＮＯ）、ステップＳ３６００へ進む。 Then, in step S3400, the interlocutor management unit 140 determines whether or not the user has a conversation with someone from the collation result. That is, the dialogue manager 140 determines whether or not there is a user dialogue. This determination is made based on whether or not the position set in any of the terminal IDs is included within a predetermined angle range based on the direction of the user's face indicated by the direction data. If there is a dialog person (S3400: YES), the dialog manager 140 proceeds to step S3500. In addition, when there is no dialogue person (S3400: NO), the dialogue manager 140 proceeds to step S3600.

ステップＳ３５００において、対話者管理部１４０は、該当する他端末ＩＤを対話者端末ＩＤとして設定した対話者情報を生成する。 In step S3500, the talker management unit 140 generates talker information in which the corresponding other terminal ID is set as the talker terminal ID.

また、ステップＳ３６００において、対話者管理部１４０は、対話者を不定とする対話者情報を生成する。 In step S3600, the dialog manager 140 generates dialog information that makes the dialog person undefined.

そして、ステップＳ３７００において、対話者管理部１４０は、生成した対話者情報を付加した自端末音声データを、音声コミュニケーションサーバ３００へ送信する。これにより、自端末音声データと、ユーザとユーザの現在の対話者とを示す対話者情報とが、他の音声コミュニケーション端末１００へ送信されることになる。 In step S <b> 3700, conversation person manager 140 transmits the self-terminal voice data to which the created conversation person information is added to voice communication server 300. Thereby, the own terminal voice data and the talker information indicating the user and the current talker of the user are transmitted to the other voice communication terminal 100.

図５は、音声コミュニケーション端末１００の送信データの構成の一例を示す図である。 FIG. 5 is a diagram illustrating an example of a configuration of transmission data of the voice communication terminal 100.

図５に示すように、送信データ６１０は、ＩＰアドレス等から成る送信元アドレス６１１および宛先アドレス６１２と、対話者情報６１３と、音声データ６１４とから成る。対話者情報６１３は、上述の通り、送信元ＩＤ６１５および対話者端末ＩＤ６１６を含む。 As shown in FIG. 5, the transmission data 610 includes a transmission source address 611 and a destination address 612 that are IP addresses and the like, dialogue information 613, and audio data 614. The conversation person information 613 includes the transmission source ID 615 and the conversation person terminal ID 616 as described above.

図６は、音声制御処理（図３のステップＳ５０００）を示すフローチャートである。 FIG. 6 is a flowchart showing the voice control process (step S5000 in FIG. 3).

ステップＳ５０１０において、対話者管理部１４０は、受信した他端末音声データの対話者情報から、対話者端末ＩＤおよび送信元ＩＤを取得し、対話者データとして、音声配置部１５０へ出力する。 In step S5010, the dialog manager 140 obtains the talker terminal ID and the transmission source ID from the talker information of the received other terminal voice data, and outputs them as the talker data to the voice placement unit 150.

そして、ステップＳ５０２０において、音声配置部１５０は、入力された対話者端末ＩＤに対して位置が設定されているか否かを判断する。音声配置部１５０は、対話者端末ＩＤに対して位置が設定されていない場合（Ｓ５０２０：ＮＯ）、つまり、新たな会話グループが出現したとき、ステップＳ５０３０へ進む。また、音声配置部１５０は、対話者端末ＩＤに対して位置が設定されている場合（Ｓ５０２０：ＹＥＳ）、ステップＳ５０４０へ進む。 In step S5020, the voice placement unit 150 determines whether or not a position is set with respect to the input talker terminal ID. If the position is not set for the talker terminal ID (S5020: NO), that is, when a new conversation group appears, the voice placement unit 150 proceeds to step S5030. In addition, when the position is set with respect to the talker terminal ID (S5020: YES), the voice placement unit 150 proceeds to step S5040.

ステップＳ５０３０において、音声配置部１５０は、空いている位置に、送信元ＩＤを配置して、ステップＳ５０９０へ進む。すなわち、音声配置部１５０は、いずれの端末ＩＤに対しても設定されていない位置を、送信元ＩＤに対して設定する。その際、対話者端末ＩＤが無効であることから、音声の向きが無指向性となるように、対話者端末ＩＤを送信元ＩＤに変更する。 In step S5030, the voice placement unit 150 places the transmission source ID in a vacant position, and proceeds to step S5090. That is, the voice placement unit 150 sets a position that is not set for any terminal ID, for the transmission source ID. At that time, since the conversation person terminal ID is invalid, the conversation person terminal ID is changed to the transmission source ID so that the direction of the voice becomes omnidirectional.

ステップＳ５０４０において、音声配置部１５０は、入力された送信元ＩＤに対して既に位置が設定されているか否かを判断する。音声配置部１５０は、送信元ＩＤに対して位置が設定されていない場合（Ｓ５０４０：ＮＯ）、つまり、例えば送信元のユーザが始めて話し掛けてきたとき、ステップＳ５０５０へ進む。また、音声配置部１５０は、送信元ＩＤに対して位置が設定されている場合（Ｓ５０４０：ＹＥＳ）、ステップＳ５０６０へ進む。 In step S5040, the voice placement unit 150 determines whether a position has already been set for the input source ID. If the position is not set for the transmission source ID (S5040: NO), that is, for example, when the transmission source user speaks for the first time, the voice placement unit 150 proceeds to step S5050. If the position is set with respect to the transmission source ID (S5040: YES), the voice placement unit 150 proceeds to step S5060.

ステップＳ５０５０において、音声配置部１５０は、対話者端末ＩＤの近辺に、送信元ＩＤを配置して、後述のステップＳ５０９０へ進む。すなわち、音声配置部１５０は、対話者端末ＩＤの配置から所定の範囲内となる位置を、送信元ＩＤに対して設定する。 In step S5050, the voice placement unit 150 places a transmission source ID in the vicinity of the talker terminal ID, and proceeds to step S5090 to be described later. That is, the voice placement unit 150 sets a position that falls within a predetermined range from the placement of the talker terminal ID for the transmission source ID.

一方、ステップＳ５０６０において、音声配置部１５０は、対話者管理部１４０が保持する対話者データに該当する送信元ＩＤの会話ペアと、他の音声コミュニケーション端末１００から受信した対話者情報の会話ペアとを比較する。そして、音声配置部１５０は、会話ペアに変化があったか否かを判断する。すなわち、音声配置部１５０は、送信元が会話相手を変えた結果として、その送信元から受信した対話者情報の会話ペアの組み合わせが、対話者管理部１４０の保持する対話者データの会話ペアの組み合わせと異なるか否かを判断する。音声配置部１５０は、会話ペアに変化がない場合（Ｓ５０６０：ＮＯ）、ステップＳ５０７０へ進む。また、音声配置部１５０は、会話ペアに変化があった場合には（Ｓ５０６０：ＹＥＳ）、ステップＳ５０８０へ進む。 On the other hand, in step S5060, the voice placement unit 150 includes a conversation pair having a sender ID corresponding to the conversation person data held by the conversation person management unit 140, and a conversation pair having conversation person information received from another voice communication terminal 100. Compare Then, the voice placement unit 150 determines whether or not the conversation pair has changed. That is, as a result of the transmission source changing the conversation partner, the voice placement unit 150 determines that the conversation pair combination of the conversation person information received from the transmission source is the conversation pair of conversation person data held by the conversation person management unit 140. Judge whether it is different from the combination. If there is no change in the conversation pair (S5060: NO), the voice placement unit 150 proceeds to step S5070. If the conversation pair has changed (S5060: YES), the voice placement unit 150 proceeds to step S5080.

ステップＳ５０７０において、音声配置部１５０は、他の音声コミュニケーション端末１００から受信した対話者データの会話ペアの、送信元ＩＤと対話者端末ＩＤとの距離が遠いか否かを判断する。すなわち、音声配置部１５０は、送信元ＩＤに対して現在設定している位置と、対話者端末ＩＤに対して現在設定している位置とが、例えば、所定の距離以上離れているか否かを判断する。音声配置部１５０は、会話ペアの距離が近い場合（Ｓ５０７０：ＮＯ）、ステップＳ５１００へ進む。また、音声配置部１５０は、会話ペアの距離が遠い場合（Ｓ５０７０：ＹＥＳ）、ステップＳ５０８０へ進む。 In step S5070, the voice placement unit 150 determines whether the distance between the transmission source ID and the talker terminal ID of the conversation pair of the talker data received from the other voice communication terminal 100 is long. That is, the voice placement unit 150 determines whether or not the position currently set for the transmission source ID and the position currently set for the talker terminal ID are more than a predetermined distance, for example. to decide. If the distance between the conversation pairs is short (S5070: NO), the voice placement unit 150 proceeds to step S5100. If the distance between the conversation pairs is long (S5070: YES), the voice placement unit 150 proceeds to step S5080.

ステップＳ５０８０において、音声配置部１５０は、送信元ＩＤを対話者端末ＩＤに近付けた状態で、送信元および対話者端末を再配置して、ステップＳ５０９０へ進む。すなわち、音声配置部１５０は、送信元ＩＤと対話者端末ＩＤとに対して、互いに近くなるような位置を設定する。併せて、音声配置部１５０は、送信元ＩＤの位置から対話者端末ＩＤの位置に向かう方向へ、音声の指向性を設定する。 In step S5080, the voice placement unit 150 rearranges the transmission source and the talker terminal in a state where the transmission source ID is close to the talker terminal ID, and proceeds to step S5090. That is, the voice placement unit 150 sets positions that are close to each other with respect to the transmission source ID and the talker terminal ID. In addition, the voice placement unit 150 sets the voice directivity in the direction from the position of the transmission source ID to the position of the talker terminal ID.

ステップＳ５０９０において、音声配置部１５０は、変化後の配置データを、対話者管理部１４０へ出力して、ステップＳ５１１０へ進む。すなわち、音声配置部１５０は、音源配置の設定内容が変化する毎に、配置データを更新する。 In step S5090, the voice placement unit 150 outputs the changed placement data to the talker management unit 140, and proceeds to step S5110. In other words, the sound placement unit 150 updates the placement data every time the content of the sound source placement changes.

また、ステップＳ５１００において、音声配置部１５０は、送信元ＩＤおよび対話者端末ＩＤを、現在と同じ位置に再配置して、ステップＳ５１１０へ進む。すなわち、音声配置部１５０は、送信元ＩＤと対話者端末ＩＤとに対して、現在設定されている位置と方向とを設定する。なお、同じ内容での再配置および配置データの生成を不要とするために、音声配置部１５０は、一旦生成した配置データを一定期間保持するようにしても良い。 In step S5100, the voice placement unit 150 rearranges the transmission source ID and the talker terminal ID at the same position as the current one, and proceeds to step S5110. That is, the voice placement unit 150 sets the currently set position and direction for the transmission source ID and the talker terminal ID. In order to eliminate the need for rearrangement with the same contents and generation of arrangement data, the audio arrangement unit 150 may hold the arrangement data once generated for a certain period.

そして、ステップＳ５１１０において、音声配置部１５０は、現在設定している配置に基づいて、他端末音声データを処理し、処理後の音声データを音声出力部１６０へ出力する。例えば、端末Ａの音声出力部１６０は、図８に示す配置データ６３０に基づいて他端末音声データを処理することにより、音声出力装置４００において、図７に示すような立体音響空間が実現される。 In step S <b> 5110, the voice placement unit 150 processes the other terminal voice data based on the currently set placement, and outputs the processed voice data to the voice output unit 160. For example, the audio output unit 160 of the terminal A processes the other terminal audio data based on the arrangement data 630 shown in FIG. 8, thereby realizing a stereophonic space as shown in FIG. 7 in the audio output device 400. .

そして、ステップＳ５１２０において、音声出力部１６０は、入力された処理後の他端末音声データを、音声信号に変換して、音声出力装置４００へ送信し、音声制御処理を終了する。 In step S5120, the audio output unit 160 converts the input other-terminal audio data after processing into an audio signal, transmits the audio signal to the audio output device 400, and ends the audio control process.

図７は、音声コミュニケーション端末１００の端末Ａに設定されている音源配置の一例を示す図である。ここでは、端末Ａ、Ｄ、Ｅにより構成される会話グループと、端末Ｂ、Ｃにより構成される会話グループとが存在している場合を例示する。 FIG. 7 is a diagram illustrating an example of the sound source arrangement set in the terminal A of the voice communication terminal 100. Here, the case where the conversation group comprised by the terminal A, D, and E and the conversation group comprised by the terminal B and C exist is illustrated.

音声配置部１５０は、例えば、音声の聴取者となるユーザの位置を中心として、対話者を含む他のユーザに対応する各音源を、中心から一定の距離を置いて半円状に配置することになる。また、音声配置部１５０は、左右の配置のバランスは必ずしも均等としないが、各会話グループが分断されないように各音源を配置することになる。すなわち、音声配置部１５０は、同一の会話グループを構成する複数の他端末からの音声の音源の範囲内に、その会話グループを構成しない他端末からの音声の音源が位置しないように、各音源を配置する。 For example, the sound placement unit 150 places each sound source corresponding to other users including the talker in a semicircular shape at a certain distance from the center with the position of the user who is the listener of the sound as the center. become. The voice placement unit 150 places the sound sources so that the conversation groups are not divided, although the balance of the left and right placements is not necessarily equal. That is, the voice placement unit 150 is configured so that the sound source of the sound from the other terminal that does not constitute the conversation group is not located within the range of the sound source of the sound from the plurality of other terminals that constitute the same conversation group. Place.

図８は、図９に示す音源配置がある音声コミュニケーション端末１００において設定されている場合に、各音声コミュニケーション端末１００が生成する配置データの一例を示す図である。配置データは音声コミュニケーション端末１００毎に個別に生成されるが、ここでは各配置データをまとめて示す。なお、各端末の指向性については図示していない。 FIG. 8 is a diagram illustrating an example of arrangement data generated by each voice communication terminal 100 when the voice communication terminal 100 having the sound source arrangement shown in FIG. 9 is set. The arrangement data is individually generated for each voice communication terminal 100, but here, the arrangement data are collectively shown. The directivity of each terminal is not shown.

図８に示すように、各音声コミュニケーション端末１００（端末ＩＤで示す）は、配置データ６３０として、他端末ＩＤ６３１に対応付けて、設定された音源の方向を示す方位角６３２を記述するデータを生成する。ここでは、方位角が、正面を０度とし、右への回転角を正、左への回転角を負として−１８０度から１８０度までの値で示される場合を例示している。なお、仰伏角が用いられる場合は、仰伏角は、例えば、水平を０度とし、上を正、下を負として−９０度から９０度までの値で示される。 As shown in FIG. 8, each voice communication terminal 100 (indicated by a terminal ID) generates, as arrangement data 630, data describing an azimuth angle 632 indicating the direction of the set sound source in association with the other terminal ID 631. To do. Here, the case where the azimuth is represented by a value from −180 degrees to 180 degrees, with the front angle being 0 degrees, the right rotation angle being positive, and the left rotation angle being negative is illustrated. When the elevation angle is used, the elevation angle is represented by a value from −90 degrees to 90 degrees, for example, the horizontal is 0 degree, the upper is positive, and the lower is negative.

ここで、図８の配置データが用いられている状態（図７に示す音源配置の状態）での端末Ａのユーザの仮想空間において、端末Ｆのユーザが、端末Ｃのユーザに話し掛け、端末Ｂ、Ｃ、Ｆのユーザが会話を開始したものとする。この会話の開始により、端末Ａ、Ｄ、Ｅは、１つの会話グループ（以下「第１の会話グループ」という）となる。また、端末Ｂ、Ｃ、Ｆは、別の会話グループ（以下「第２の会話グループ」という）となる。ところが、この場合、図７に示す音源配置のままでは、端末Ｆと端末Ｃとの距離は遠く、第１の会話グループと第２の会話グループとは交差した状態となる。したがって、音声コミュニケーション端末１００は、例えば端末Ｆからの対話者情報に基づき、端末Ｆが端末Ｃに近付くように、音源の再配置を行い、配置データを変更する。 Here, in the virtual space of the user of the terminal A in the state where the arrangement data of FIG. 8 is used (the state of the sound source arrangement shown in FIG. 7), the user of the terminal F speaks to the user of the terminal C, and the terminal B , C, F users start a conversation. By starting the conversation, the terminals A, D, and E become one conversation group (hereinafter referred to as “first conversation group”). Terminals B, C, and F are in another conversation group (hereinafter referred to as “second conversation group”). However, in this case, with the sound source arrangement shown in FIG. 7, the distance between the terminal F and the terminal C is long, and the first conversation group and the second conversation group are in an intersecting state. Therefore, the voice communication terminal 100 rearranges the sound sources and changes the arrangement data so that the terminal F approaches the terminal C based on the conversation person information from the terminal F, for example.

図９は、音源配置の変更の様子の一例を示す図であり、図７に対応するものである。 FIG. 9 is a diagram illustrating an example of a change in sound source arrangement, and corresponds to FIG.

まず、図９（Ａ）に示すように、音声配置部１５０は、端末Ｆの位置を端末Ｃの近くに移動させる。この結果、端末Ｂ、Ｃ、Ｆの位置はまとまり、第１の会話グループと第２の会話グループとが交差しなくなる。これにより、会話グループの音声が区別し易くなる。そして、図９（Ｂ）に示すように、音声配置部１５０は、端末Ｂ〜Ｆの間隔が均等になるように、各端末の位置を調整する。この結果、会話グループ内の音声が区別し易くなる。 First, as illustrated in FIG. 9A, the voice placement unit 150 moves the position of the terminal F closer to the terminal C. As a result, the positions of the terminals B, C, and F are gathered, and the first conversation group and the second conversation group do not cross each other. Thereby, it becomes easy to distinguish the voice of the conversation group. And as shown in FIG.9 (B), the audio | voice arrangement | positioning part 150 adjusts the position of each terminal so that the space | interval of terminal BF becomes equal. As a result, the voices in the conversation group can be easily distinguished.

図１０および図１１は、図９に示す音源配置の変更があった場合の配置データの一例を示す図であり、図８に対応するものである。図１０は、図９（Ａ）の段階における配置データの一例であり、図１１は、図９（Ｂ）の段階における配置データの一例である。 10 and 11 are diagrams showing an example of arrangement data when the sound source arrangement shown in FIG. 9 is changed, and corresponds to FIG. 10 is an example of arrangement data at the stage of FIG. 9A, and FIG. 11 is an example of arrangement data at the stage of FIG. 9B.

図１０および図１１に示すように、端末Ｆが第２の会話グループに参加した結果、所定の配置変更ルールに従って、配置データが段階的に変更される。この結果、最終的に、図９（Ｂ）に示す音源配置が、実際の音声出力において実現されることになる。そして、端末Ａのユーザには、会話グループ１の音声と会話グループ２の音声とがそれぞれまとまった異なる方向から聞こえ、かつ、個々の音声は異なる方向から聞こえる。したがって、端末Ａのユーザは、個々の発話が、誰のものであり、どの会話グループのものであるのかを容易に把握することができる。 As shown in FIGS. 10 and 11, as a result of terminal F participating in the second conversation group, the arrangement data is changed in stages according to a predetermined arrangement change rule. As a result, the sound source arrangement shown in FIG. 9B is finally realized in the actual audio output. The user of the terminal A can hear the voice of the conversation group 1 and the voice of the conversation group 2 from different directions, and each voice can be heard from different directions. Therefore, the user of terminal A can easily grasp who each utterance belongs to and which conversation group.

なお、各音声コミュニケーション端末１００には、その音声コミュニケーション端末１００を中心とした音源配置が設定される。 Each voice communication terminal 100 is set with a sound source arrangement centered on the voice communication terminal 100.

図１２は、各音声コミュニケーション端末１００に設定される音源配置の一例を示す図である。図１２（Ａ）〜図１２（Ｆ）は、順に、端末Ａ〜Ｆに設定される配置データの内容を示す。 FIG. 12 is a diagram illustrating an example of a sound source arrangement set for each voice communication terminal 100. FIGS. 12A to 12F sequentially show the contents of the arrangement data set in the terminals A to F.

図１２に示すように、各音声コミュニケーション端末１００では、上述の所定の配置ルールに適合するように、自己以外の音声コミュニケーション端末１００の音源が周囲に仮想的に配置される。 As shown in FIG. 12, in each voice communication terminal 100, the sound sources of the voice communication terminals 100 other than the self are virtually arranged in the vicinity so as to conform to the predetermined arrangement rule described above.

このような動作により、音声コミュニケーション端末１００は、会話グループが変化したとき、これを検出し、会話グループがまとまった方向から聞こえるように音声出力を制御することができる。 With such an operation, the voice communication terminal 100 can detect when the conversation group has changed, and can control the voice output so that the conversation group can be heard from a grouped direction.

なお、音源の位置が急激に変化すると、ユーザが、不快感を覚えたり、誰の発話音声なのか、および、どの会話グループの会話なのかを把握し辛くなるおそれがある。 Note that if the position of the sound source changes abruptly, the user may feel uncomfortable, and it may be difficult to grasp who is speaking and what conversation group the conversation is.

したがって、音声配置部１５０は、配置を変更する際、各音源の位置が滑らかに移動するよう、配置データを段階的に変化させて出力しても良い。例えば、図７に示す状態から図９（Ａ）に示す状態へと変化させる際、音声配置部１５０は、端末Ｆの音源位置を、端末Ｅの方向、端末Ｄの方向を経由して移動させる形で、途中の位置を補間すれば良い。 Therefore, when changing the arrangement, the audio arrangement unit 150 may output the arrangement data while changing the arrangement data stepwise so that the position of each sound source moves smoothly. For example, when changing from the state shown in FIG. 7 to the state shown in FIG. 9A, the voice placement unit 150 moves the sound source position of the terminal F via the direction of the terminal E and the direction of the terminal D. It is sufficient to interpolate the position in the middle.

以上のように、本実施の形態に係る音声コミュニケーション端末１００は、ユーザの顔の向きに基づいて会話グループを検出し、会話グループの変化に応じて音源配置の設定を変更する。これにより、本実施の形態は、会話グループが不特定であっても快適な会話環境を実現することができる。 As described above, the voice communication terminal 100 according to the present embodiment detects a conversation group based on the orientation of the user's face, and changes the sound source arrangement setting according to the change of the conversation group. Thereby, this Embodiment can implement | achieve a comfortable conversation environment, even if a conversation group is unspecified.

なお、対話者の特定は、本実施の形態ではユーザが発話するときのユーザの顔の向きに基づいて行ったが、これに限定されない。例えば、音声コミュニケーション端末１００は、自端末音声データに対して音声認識処理を行い、発話に含まれる他のユーザの名称から、対話者を特定しても良い。この場合、音声コミュニケーション端末１００は、予め他の音声コミュニケーション端末１００からユーザの名称のテキストデータを受信して保持しておく等して、各端末に対応付けて各ユーザの名称を記憶しておく必要がある。 In this embodiment, the conversation person is specified based on the orientation of the user's face when the user speaks, but the present invention is not limited to this. For example, the voice communication terminal 100 may perform voice recognition processing on the voice data of its own terminal, and specify a conversation person from the names of other users included in the utterance. In this case, the voice communication terminal 100 stores the name of each user in association with each terminal by receiving and holding text data of the user name from other voice communication terminals 100 in advance. There is a need.

なお、この際、音声コミュニケーション端末１００は、処理負荷の軽減のため、例えば、音声認識処理の対象を、音声入力が開始されてから最初の数秒のみや、ユーザ操作によりキースイッチが押下されている間のみに限定しても良い。 At this time, in order to reduce the processing load, the voice communication terminal 100, for example, the target of the voice recognition process is pressed only for the first few seconds after the voice input is started or by the user operation. It may be limited to only between.

また、例えば、音声コミュニケーション端末１００は、リモートコントローラにおける十字キー等のユーザ操作により、音源の方向に対する指定を受け付け、指定された方向に設定された他端末のユーザを、対話者として特定しても良い。 Further, for example, the voice communication terminal 100 may accept designation of the direction of the sound source by a user operation such as a cross key on the remote controller, and specify a user of another terminal set in the designated direction as a conversation person. good.

また、音声コミュニケーション端末１００は、ユーザの顔の向きと音声認識処理等の両方を行い、会話相手の検出の精度の向上を図るようにしても良い。 In addition, the voice communication terminal 100 may perform both of the user's face orientation and voice recognition processing to improve the accuracy of detecting a conversation partner.

また、会話グループの抽出は、本実施の形態では対話者情報に基づいて行ったが、これに限定されない。例えば、音声コミュニケーション端末１００は、各ユーザの発話音声に含まれる共通のキーワードに基づいて、会話グループ（ユーザの対話者および会話ペア）を抽出しても良い。また、音声コミュニケーション端末１００は、ユーザの顔の向きに基づく会話グループの抽出とキーワードに基づく会話グループの抽出との両方を行い、会話グループの抽出の精度の向上を図っても良い。 Further, in the present embodiment, the conversation group is extracted based on the conversation person information, but the present invention is not limited to this. For example, the voice communication terminal 100 may extract a conversation group (user's talker and conversation pair) based on a common keyword included in each user's speech. The voice communication terminal 100 may perform both the extraction of the conversation group based on the face direction of the user and the extraction of the conversation group based on the keyword, thereby improving the accuracy of the conversation group extraction.

また、対話者情報は、送信元アドレス等の他の情報によって対話者情報の送信元を特定可能である場合には、必ずしも送信元ＩＤを含まなくても良い。 In addition, the conversation person information does not necessarily include the transmission source ID when the transmission source of the conversation person information can be specified by other information such as the transmission source address.

また、音声コミュニケーションサーバ３００は、音声データを転送する機能を有するだけでなく、データベースに音声データを蓄積する機能を有しても良い。また、本発明が適用されるネットワークは、音声コミュニケーション端末１００同士で直接に接続して通信を行う、サーバレス構成のネットワークであっても良い。 The voice communication server 300 may have not only a function of transferring voice data but also a function of storing voice data in a database. The network to which the present invention is applied may be a serverless network in which voice communication terminals 100 are directly connected to perform communication.

また、対話者情報は、本実施の形態では音声データと共に送信したが、これに限定されない。音声コミュニケーション端末１００は、対話者情報を、音声の入力タイミングや音声の送信タイミングとは異なるタイミングで生成し、送信しても良い。例えば、音声コミュニケーション端末１００は、ユーザの顔の向きの累積時間から、発話者情報を定期的に生成し、送信するようにしても良い。 In addition, in the present embodiment, the conversation person information is transmitted together with the voice data, but the present invention is not limited to this. The voice communication terminal 100 may generate and transmit the talker information at a timing different from the voice input timing and the voice transmission timing. For example, the voice communication terminal 100 may periodically generate and transmit speaker information from the accumulated time of the user's face orientation.

このような場合、音声データを送信する処理および対話者情報を生成して送信する処理は、別のスレッドで同時に実行されても良い。また、音声データを受信する処理、対話者情報を受信する処理、および配置変更を行う処理は、別のスレッドで同時に実行されても良い。 In such a case, the process of transmitting the voice data and the process of generating and transmitting the talker information may be executed simultaneously in different threads. Also, the process of receiving audio data, the process of receiving interlocutor information, and the process of changing the arrangement may be executed simultaneously in different threads.

また、音源の配置は、本実施の形態では半円状の分散配置としたが、これに限定されない。例えば、音声コミュニケーション端末１００は、上下方向や前後方向に音源を分散させて配置させたり、会話グループ毎の音源位置をまとめても良い。 In addition, the arrangement of the sound sources is a semicircular distributed arrangement in the present embodiment, but is not limited to this. For example, the voice communication terminal 100 may disperse and arrange sound sources in the vertical direction and the front-rear direction, or may group the sound source positions for each conversation group.

音源位置をまとめた場合、一人の発話者以外が聴取者となって発話者が入れ替わりながら会話が進む通常の会話では、会話内容を把握することが可能である。すなわち、会話グループ毎に音源位置をまとめることは、話者の数が多い場合や会話グループの数が多い場合に好適である。 When the sound source positions are collected, it is possible to grasp the content of the conversation in a normal conversation in which a conversation other than one speaker is a listener and the speaker is switched. In other words, collecting sound source positions for each conversation group is preferable when the number of speakers is large or the number of conversation groups is large.

したがって、音声コミュニケーション端末１００は、話者の数や会話グループの数が所定の閾値に達したときには、会話グループ毎に音源位置をまとめたり、更にこれらの数が増大したときには、新たな音源の設定を保留するようにしても良い。また、逆に、音声コミュニケーション端末１００は、音源位置をまとめた後に話者の数や会話グループの数が減少したときには、個々の音源が分散されるように音源の再配置を行っても良い。 Therefore, when the number of speakers or the number of conversation groups reaches a predetermined threshold, the voice communication terminal 100 collects sound source positions for each conversation group, and when these numbers further increase, a new sound source setting is performed. May be put on hold. Conversely, the voice communication terminal 100 may rearrange the sound sources so that the individual sound sources are distributed when the number of speakers or the number of conversation groups decreases after the sound source positions are collected.

また、本発明は、本実施の形態ではユーザ側の装置である音声コミュニケーション端末１００に適用したが、これに限定されない。本発明は、例えば、複数の端末の間で音声データの中継を行う装置（例えば、本実施の形態の音声コミュニケーションサーバ３００）に適用しても良い。 Moreover, although this invention was applied to the voice communication terminal 100 which is a user side apparatus in this Embodiment, it is not limited to this. The present invention may be applied to, for example, a device that relays voice data between a plurality of terminals (for example, the voice communication server 300 of the present embodiment).

また、本発明は、上述の不特定多数が参加するチャットシステム以外にも、電話会議システム等、各種の多地点音声通信システムに適用することができる。 Further, the present invention can be applied to various multipoint voice communication systems such as a telephone conference system in addition to the chat system in which the unspecified majority participates.

本発明に係る音声コミュニケーション装置および音声コミュニケーション方法は、会話グループが流動的であっても、快適な会話環境を実現することができる音声コミュニケーション装置および音声コミュニケーション方法として有用である。 The voice communication device and the voice communication method according to the present invention are useful as a voice communication device and a voice communication method capable of realizing a comfortable conversation environment even when the conversation group is fluid.

１００音声コミュニケーション端末
１１０音声情報送受信部
１２０音声入力部
１３０方向取得部
１４０対話者管理部
１５０音声配置部
１６０音声出力部
２００音声入力装置
３００音声コミュニケーションサーバ
４００音声出力装置
DESCRIPTION OF SYMBOLS 100 Voice communication terminal 110 Voice information transmission / reception part 120 Voice input part 130 Direction acquisition part 140 Dialogue management part 150 Voice arrangement | positioning part 160 Voice output part 200 Voice input apparatus 300 Voice communication server 400 Voice output apparatus

Claims

A voice communication device that controls at least one voice output among a plurality of terminals participating in a multipoint voice communication system,
A sound placement unit for setting sound source placement when sound from other terminals is output;
A conversation manager managing a speaker and a conversation partner who is a counterpart of the plurality of terminals, and detecting a conversation group based on the detected combination of the speaker and the conversation person; And
The voice placement unit is
Changing the setting of the sound source arrangement according to the detected change of the conversation group;
Voice communication device.

The dialogue manager is
Detecting the interlocutor based on respective face orientations of a plurality of users of the plurality of terminals;
The voice communication apparatus according to claim 1.

The voice placement unit is
Changing the setting of the sound source arrangement so that the sound source of the sound from the terminal that does not constitute the conversation group is not located within the range of the sound source of the sound from the plurality of terminals constituting the same conversation group;
The voice communication apparatus according to claim 2.

The dialogue manager is
Detecting the interlocutor from the relationship between the orientation of the user's face for each terminal and the sound source arrangement set for the terminal;
The voice communication apparatus according to claim 3.

Provided in the terminal to be controlled,
A voice information transceiver for communicating with the other terminal;
A voice input unit that acquires voice data including speech of the user of the terminal, and transmits the acquired voice data to the other terminal using the voice information transmission / reception unit;
A direction acquisition unit for acquiring the orientation of the user's face;
Using the voice information transmitting / receiving unit, the voice data including the voice of the user of the terminal is received from the other terminal, and the voice is output based on the received voice data according to the set sound source arrangement. An audio output unit,
The dialogue manager is
The terminal of the conversation partner of the user is identified from the acquired relationship between the orientation of the face of the user and the set sound source arrangement, and the identified conversation partner terminal is indicated using the voice information transmitting / receiving unit. The information indicating the terminal to be controlled and the information indicating the terminal to be controlled are transmitted to the other terminal as dialogue information, and the dialogue information transmitted from the other terminal is received, and the received dialogue information is received. Detecting the conversation group based on
The voice communication apparatus according to claim 4.

A voice recognition unit that extracts a name of the user of the other terminal by voice recognition processing from the voice data acquired by the voice input unit;
The dialogue manager is
The conversation group is detected from the relationship between the orientation of the user's face and the sound source arrangement and the extracted name.
The voice communication apparatus according to claim 1.

A voice communication method for controlling at least one voice output among a plurality of terminals participating in a multipoint voice communication system,
Detecting a speaker and a conversation partner who is the counterpart from the plurality of terminals, and detecting a conversation group based on the detected combination of the speaker and the conversation person;
Changing the setting of the sound source arrangement when sound from other terminals is output in accordance with the detected change of the conversation group,
Voice communication method.