JP5403137B2

JP5403137B2 - Network communication system

Info

Publication number: JP5403137B2
Application number: JP2012237429A
Authority: JP
Inventors: 訓志三浦
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2012-10-29
Filing date: 2012-10-29
Publication date: 2014-01-29
Anticipated expiration: 2027-03-30
Also published as: JP2013042542A

Description

この発明は、互いに離れた場所に設置された複数の通信装置をネットワーク接続して、互いに音声や映像を通信するネットワーク通信システムに関するものである。 The present invention relates to a network communication system in which a plurality of communication devices installed at locations distant from each other are connected to a network to communicate audio and video with each other.

従来、例えば、本社、支社、支店のような遠隔地間で会議を行うシステムが各種開示されている。このような会議システムでは、会議により臨場感を与えるために、特許文献１や特許文献２に示すように、話者検出を行い、現在発言中の話者の映像を、例えば強調するように表示させる等の処理を行っている。 Conventionally, various systems for performing a conference between remote locations such as a head office, a branch office, and a branch office have been disclosed. In such a conference system, as shown in Patent Document 1 and Patent Document 2, in order to give a sense of realism by the conference, speaker detection is performed, and the video of the currently speaking speaker is displayed, for example, so as to be emphasized. Processing such as making it happen.

特開平４−３３９４８４号公報JP-A-4-339484 特許第２６４８１９２号公報Japanese Patent No. 2648192

上述のような会議システムでは、それぞれの会議装置の映像データ、音声データ、話者情報を管理する必要がある。このため、特許文献１では、ネットワークに集中制御装置を備え、当該集中制御装置により全てのデータおよび情報の通信制御を行っている。また、特許文献２では、ネットワーク接続されている複数の会議装置のうちで、一台に送信権を与え、送信許可が得られた会議装置のみが他の装置にデータや情報を送信している。 In the conference system as described above, it is necessary to manage video data, audio data, and speaker information of each conference device. For this reason, in Patent Document 1, a central control device is provided in a network, and communication control of all data and information is performed by the central control device. Further, in Patent Document 2, among a plurality of conference apparatuses connected to a network, only one conference apparatus that gives a transmission right to the transmission apparatus and obtains transmission permission transmits data and information to other apparatuses. .

しかしながら、集中制御装置を用いる場合には、当該集中制御装置にデータが集中するため集中制御装置を大規模で処理が複雑なものにしなければならない。また、送信権を利用する場合には、送信権を持つ唯一の装置のみしかデータを送信できないため、フレキシビリティに欠けるものとなる。 However, when a centralized control device is used, since data concentrates on the centralized control device, the centralized control device must be made large-scale and complicated. Further, when the transmission right is used, data can be transmitted only by a single device having the transmission right, so that flexibility is lacking.

したがって、本発明の目的は、大規模なシステムを構成することなく、例えば放収音装置のような複数の会議装置の話者情報をほぼリアルタイムに取得し利用することができるネットワーク通信システムを提供することにある。 Therefore, an object of the present invention is to provide a network communication system that can acquire and use speaker information of a plurality of conference apparatuses such as sound emitting and collecting apparatuses almost in real time without constituting a large-scale system. There is to do.

この発明は、音声データを通信する音声データ通信網に接続し、それぞれが放収音を行う複数の放収音装置と、複数の放収音装置間の接続制御を行う通信制御サーバと、を備えたネットワーク通信システムであって、前記複数の放収音装置はそれぞれに話者検出手段を備え、前記通信制御サーバは、前記複数の放収音装置のそれぞれに対して話者検出情報取得制御を行い、前記複数の放収音装置は、前記話者検出情報取得制御を受け付けると、前記話者検出手段で取得した話者検出情報を前記通信制御サーバへ送信し、前記音声データは前記通信制御サーバを介さずに、複数の放収音装置間で直接通信を行う、ことを特徴としている。 The present invention includes a plurality of sound emission and collection devices that are connected to an audio data communication network that communicates audio data, each of which emits and collects sound, and a communication control server that performs connection control between the plurality of sound emission and collection devices. The plurality of sound emission and collection devices each include a speaker detection means, and the communication control server controls speaker detection information acquisition for each of the plurality of sound emission and collection devices. When the plurality of sound emission and collection devices accept the speaker detection information acquisition control, the speaker detection information acquired by the speaker detection means is transmitted to the communication control server, and the voice data is transmitted from the communication control server. It is characterized in that direct communication is performed between a plurality of sound emitting and collecting devices without going through a control server.

この構成では、通信制御サーバは、各放収音装置間での音声データの通信制御は行わず、各放収音装置間での接続開始等の接続制御のみを行う。このような接続制御とは別に、通信制御サーバは各放収音装置に対して話者検出情報の取得制御を行う。各放収音装置は、話者検出情報取得制御を受け付けると、発話者の有無を含む自装置周りの話者位置情報を取得して通信制御サーバへ送信する。通信制御サーバがこの話者位置情報を受け付けることで、ネットワークに接続された各放収音装置の話者位置情報が一括で管理される。この際、通信制御サーバは、データ量が大きいながらも通信速度が要求される音声データの通信制御を行う必要がないので、簡素な構成で且つ簡潔な処理により話者位置情報の取得、管理を行うことができる。 In this configuration, the communication control server does not perform audio data communication control between the sound emitting and collecting devices, but only performs connection control such as connection start between the sound emitting and collecting devices. Apart from such connection control, the communication control server performs acquisition control of speaker detection information for each sound emission and collection device. When each sound emission and collection device accepts the speaker detection information acquisition control, it acquires the speaker position information around its own device including the presence or absence of the speaker and transmits it to the communication control server. When the communication control server accepts the speaker position information, the speaker position information of each sound emitting and collecting apparatus connected to the network is collectively managed. At this time, the communication control server does not need to perform communication control of voice data that requires a high communication speed even though the amount of data is large. Therefore, it is possible to acquire and manage speaker position information with a simple configuration and simple processing. It can be carried out.

また、この発明の放収音装置の話者検出手段は、前記放収音装置に入力された収音データのレベルが所定の閾値以上である場合に前記話者検出情報を生成することを特徴とする。 Further, the speaker detection means of the sound emission and collection device of the present invention generates the speaker detection information when the level of the sound collection data input to the sound emission and collection device is a predetermined threshold or more. And

また、この発明のネットワーク通信システムの通信制御サーバは、取得した前記話者検出情報と、該話者検出情報を送信した前記放収音装置と、を対応付けた話者状況データを生成する話者情報管理部を備えることを特徴とする。
この構成では、通信制御サーバは、各放収音装置における話者の有無を認識することができる。 In addition, the communication control server of the network communication system according to the present invention generates a talk state data that associates the acquired talker detection information with the sound emission and collection device that has transmitted the talker detection information. A person information management unit is provided.
In this configuration, the communication control server can recognize the presence or absence of a speaker in each sound emission and collection device.

また、この発明のネットワーク通信システムの通信制御サーバは、複数の放収音装置による通信のスケジュールを予め記憶しており、前記スケジュールまたは時刻情報に基づいて時系列で前記話者状況データを生成する、ことを特徴としている。 The communication control server of the network communication system according to the present invention stores in advance a schedule of communication by a plurality of sound emission and collection devices, and generates the speaker status data in time series based on the schedule or time information. It is characterized by that.

この構成では、通信制御サーバは、スケジュールにより通信が行われている間でのみ、話者検出情報を取得する。これにより、必要な時にのみ話者検出情報を得ることができ、上述のシステムの簡素化とともに、ネットワークのリソースを有効に利用することができる。 In this configuration, the communication control server acquires the speaker detection information only while communication is performed according to the schedule. Thereby, speaker detection information can be obtained only when necessary, and the resources of the network can be used effectively together with the simplification of the system described above.

また、この発明のネットワーク通信システムは、さらに、複数の放収音装置にそれぞれ対応し、映像の取得及び表示を行う複数の映像処理手段を備える。この発明のネットワーク通信システムの通信制御サーバは、取得した話者位置情報に基づく話者状況データを生成して複数の映像処理手段に与える。そして、この発明の複数の映像処理手段は、取得した話者状況データを表示することを特徴としている。 The network communication system according to the present invention further includes a plurality of video processing means corresponding to the plurality of sound emission and collection devices, respectively, for acquiring and displaying a video. The communication control server of the network communication system according to the present invention generates speaker situation data based on the acquired speaker position information and provides it to a plurality of video processing means. The plurality of video processing means of the present invention is characterized by displaying the acquired speaker status data.

この構成では、経時的に得られる話者位置情報群に基づく話者状況データが生成され、映像処理手段で表示される。これにより、発話中の話者の切り替わり等からなる話者状況を視覚的に取得することができる。 In this configuration, speaker status data based on the speaker position information group obtained over time is generated and displayed by the video processing means. Thereby, it is possible to visually acquire the speaker situation including the switching of the speaker who is speaking.

また、この発明のネットワーク通信システムの複数の映像処理手段は、話者状況データに基づいて、表示する映像の切替制御を行うことを特徴としている。 The plurality of video processing means of the network communication system according to the present invention is characterized in that switching control of video to be displayed is performed based on speaker status data.

この構成では、各映像処理手段により、話者状況データに基づいて、例えば、現在発話中の話者の映像を強調表示する等の表示切替制御が行える。 In this configuration, each video processing means can perform display switching control such as highlighting the video of the speaker who is currently speaking based on the speaker status data.

この発明によれば、通信制御サーバで制御データや音声データ等の全てを一括で管理して利用するような大きなシステムを用いることなく、簡素なシステム構成および簡素な処理で、通信している各放収音装置での話者情報を一括で管理して利用することができる。 According to the present invention, each communication is performed with a simple system configuration and a simple process without using a large system that manages and uses all of the control data, audio data, and the like collectively in the communication control server. The speaker information in the sound emission and collection device can be managed and used collectively.

本実施形態の音声通信システムの主要構成を示すブロック図である。It is a block diagram which shows the main structures of the audio | voice communication system of this embodiment. 放収音装置１１Ａの主要構成を示すブロック図である。It is a block diagram which shows the main structures of 11 A of sound emission / collection apparatuses. 放収音装置１１Ａ〜１１Ｄをメッシュ接続した態様での音声データ通信網２０１および制御データ通信網２０２を説明する図、および、放収音装置１１Ａ〜１１Ｄと同じ構成からなる放収音装置１１Ｅ〜１１Ｊをカスケード接続した態様での音声データ通信網２０１および制御データ通信網２０２を説明する図である。The figure explaining the audio | voice data communication network 201 and the control data communication network 202 in the aspect which mesh-connected the sound emission / collection apparatus 11A-11D, and the sound emission / collection apparatus 11E- consisting of the same structure as the sound emission / collection apparatus 11A-11D It is a figure explaining the audio | voice data communication network 201 and the control data communication network 202 in the aspect which cascade-connected 11J. 制御サーバ１０の主要構成を示すブロック図である。2 is a block diagram showing a main configuration of a control server 10. FIG. 映像データ通信網２０３および話者状況データ通信網２０４の構成を示す図である。2 is a diagram showing the configuration of a video data communication network 203 and a speaker status data communication network 204. FIG. 音声データ通信網２０１の確立処理フローを示したフローチャートである。6 is a flowchart showing a flow of establishing a voice data communication network 201. 音声データ通信網２０１の確立後における話者検出の処理フローを示したフローチャートである。5 is a flowchart showing a processing flow of speaker detection after the establishment of the voice data communication network 201.

本発明の実施形態に係るネットワーク通信システムついて、図を参照して説明する。以下の説明では、ネットワーク通信システムの主たる構成として、複数の放収音装置間で音声を通信する音声通信システム、特に、放収音装置とともに映像処理装置を備える映像付きの音声通信システムを例に説明する。なお、以下の説明では、映像処理装置を備える音声通信システムを説明するが、映像処理装置での処理は、話者検出情報に基づく処理の一例であり、映像処理装置が無くても良い。
図１は、本実施形態の音声通信システムの主要構成を示すブロック図である。
図２は放収音装置１１Ａの主要構成を示すブロック図である。 A network communication system according to an embodiment of the present invention will be described with reference to the drawings. In the following description, as a main configuration of a network communication system, an audio communication system that communicates audio between a plurality of sound emission and collection devices, particularly an audio communication system with a video that includes an image processing device together with a sound emission and collection device is taken as an example. explain. In the following description, an audio communication system including a video processing device will be described. However, the processing in the video processing device is an example of processing based on speaker detection information, and the video processing device may not be provided.
FIG. 1 is a block diagram showing the main configuration of the voice communication system of the present embodiment.
FIG. 2 is a block diagram showing the main configuration of the sound emission and collection device 11A.

図３（Ａ）は放収音装置１１Ａ〜１１Ｄをメッシュ接続した態様での音声データ通信網２０１および制御データ通信網２０２を説明する図であり、図３（Ｂ）は放収音装置１１Ａ〜１１Ｄと同じ構成からなる放収音装置１１Ｅ〜１１Ｊをカスケード接続した態様での音声データ通信網２０１および制御データ通信網２０２を説明する図である。
図４は、制御サーバ１０の主要構成を示すブロック図である。
図５は、映像データ通信網２０３および話者状況データ通信網２０４の構成を示す図である。 3A is a diagram for explaining the voice data communication network 201 and the control data communication network 202 in a mode in which the sound emitting and collecting devices 11A to 11D are mesh-connected, and FIG. 3B is a diagram illustrating the sound emitting and collecting devices 11A to 11A. It is a figure explaining the audio | voice data communication network 201 and the control data communication network 202 in the aspect which cascade-connected the sound emission and collection apparatuses 11E-11J which have the same structure as 11D.
FIG. 4 is a block diagram showing the main configuration of the control server 10.
FIG. 5 is a diagram showing the configuration of the video data communication network 203 and the speaker status data communication network 204.

音声通信システムは、それぞれ別の位置にある会議室１０１Ａ〜１０１Ｄの放収音装置１１Ａ〜１１Ｄおよび映像処理装置１２Ａ〜１２Ｄを、それぞれルータ１３Ａ〜１３Ｄを介してネットワーク２００で接続する構成を有する。また、このネットワーク２００には、ルータ１３Ｅを介して制御サーバ１０が接続している。そして、ネットワーク２００は、一つまたは複数の物理層からなり、当該物理層上に、音声データを伝送する音声データ通信網２０１、制御データを伝送する制御データ通信網２０２、映像データを伝送する映像データ通信網２０３が形成されている。 The audio communication system has a configuration in which sound emitting and collecting devices 11A to 11D and video processing devices 12A to 12D in conference rooms 101A to 101D at different positions are connected via a network 200 via routers 13A to 13D, respectively. Further, the control server 10 is connected to the network 200 via the router 13E. The network 200 includes one or a plurality of physical layers. On the physical layer, an audio data communication network 201 that transmits audio data, a control data communication network 202 that transmits control data, and a video that transmits video data. A data communication network 203 is formed.

各会議室１０１Ａ〜１０１Ｄのそれぞれには、放収音装置１１Ａ〜１１Ｄ、映像処理装置１２Ａ〜１２Ｄ、ルータ１３Ａ〜１３Ｄが備えられている。これらの各会議室では、各装置の接続構成が同じであり、代表して会議室１０１Ａを用いて説明する。
会議室１０１Ａには、放収音装置１１Ａ、映像処理装置１２Ａ、ルータ１３Ａが設置されており、放収音装置１１Ａおよび映像処理装置１２Ａは、ルータ１３Ａを介してネットワーク２００に接続している。 Each of the conference rooms 101A to 101D is provided with sound emission and collection devices 11A to 11D, video processing devices 12A to 12D, and routers 13A to 13D. In each of these conference rooms, the connection configuration of each device is the same, and a description will be given using the conference room 101A as a representative.
In the conference room 101A, a sound emission and collection device 11A, a video processing device 12A, and a router 13A are installed, and the sound emission and collection device 11A and the video processing device 12A are connected to the network 200 via the router 13A.

映像処理装置１２Ａは、放収音装置１１Ａの周りを撮像する撮像部と、撮像した映像から映像データを生成する映像データ生成部と、ネットワーク２００の映像データ通信網２０３を経由して取得した映像データを表示する映像表示部と、ルータ１３Ａとのデータの入出力を行う入出力部とを備える。また、図５に示すように、映像処理装置１２Ａは、話者状況データを送受信する話者状況データ通信網２０４により制御サーバ１０と接続している。 The video processing device 12A includes an imaging unit that captures the periphery of the sound emission and collection device 11A, a video data generation unit that generates video data from the captured video, and a video acquired via the video data communication network 203 of the network 200. A video display unit for displaying data and an input / output unit for inputting / outputting data to / from the router 13A are provided. As shown in FIG. 5, the video processing device 12 A is connected to the control server 10 by a speaker status data communication network 204 that transmits and receives speaker status data.

ルータ１３Ａは、他のルータ１３Ｂ〜１３Ｅと、例えば、ＩＰＳｅｃ−ＶＰＮ等のセキュアなネットワーク２００を形成する。より具体的には、ルータ１３Ａは、図３（Ａ）に示すように、放収音装置１１Ａと他の放収音装置１１Ｂ〜１１Ｄとの間を接続する音声データ通信網２０１を形成し、放収音装置１１Ａと制御サーバ１０との間を接続する制御データ通信網２０２を形成する。なお、ルータ１３Ｂ〜１３Ｄについても同様に対応する放収音装置１１Ｂ〜１１Ｄと制御サーバ１０との間を接続する制御データ通信網２０２を形成する。すなわち、放収音装置１１Ａ〜１１Ｄと制御サーバ１０との間に制御データ通信網２０２が形成される。また、ルータ１３Ａは、図５に示すように、映像処理装置１２Ａと他の映像処理装置１２Ｂ〜１２Ｄとの間を接続する映像データ通信網２０３を形成する。 The router 13A forms a secure network 200 such as IPSec-VPN with other routers 13B to 13E. More specifically, as shown in FIG. 3A, the router 13A forms a voice data communication network 201 that connects between the sound emission and collection device 11A and the other sound emission and collection devices 11B to 11D. A control data communication network 202 that connects between the sound emission and collection device 11A and the control server 10 is formed. The routers 13B to 13D also form a control data communication network 202 that connects the corresponding sound emitting and collecting apparatuses 11B to 11D and the control server 10 in the same manner. That is, the control data communication network 202 is formed between the sound emission and collection devices 11 A to 11 D and the control server 10. Further, as shown in FIG. 5, the router 13A forms a video data communication network 203 that connects the video processing device 12A and the other video processing devices 12B to 12D.

放収音装置１１Ａは、図２に示すように、制御部１１１、ネットワークＩ／Ｆ１１２、放音用ミキサ１１３、話者検出部１１４、送信音用ミキサ１１５、スピーカＳＰ、マイクＭＩＣを備える。 As shown in FIG. 2, the sound emission and collection device 11A includes a control unit 111, a network I / F 112, a sound emission mixer 113, a speaker detection unit 114, a transmission sound mixer 115, a speaker SP, and a microphone MIC.

制御部１１１は、放収音装置１１Ａの全体制御を行うとともに、ネットワークＩ／Ｆ１１２を制御して、自装置で収音した収音データを音声データ通信網２０１へ送信させ、話者検出部１１４で生成された話者検出データを制御データ通信網２０２へ送信させる。 The control unit 111 performs overall control of the sound emission and collection device 11A and controls the network I / F 112 to transmit the sound collection data collected by the own device to the voice data communication network 201, and the speaker detection unit 114. Is transmitted to the control data communication network 202.

制御部１１１は、制御サーバ１０から話者検出要求を受け付けると、図示しないタイマにて計時して、定期的に話者検出部１１４へ話者検出制御を行う。また、制御部１１１は、話者検出要求を受け付けてから、所定時間以上新たな話者検出要求を受け付けなければ、話者検出部１１４に対する話者検出制御を停止する。
また、制御部１１１は、選択された接続モードに応じて、送信音用ミキサ１１５の処理を制御する。具体的には、図３（Ａ）に示すようなメッシュ接続モードであるか、図３（Ｂ）に示すようなカスケード接続モードであるかにより処理制御を設定する。 When the control unit 111 receives a speaker detection request from the control server 10, the control unit 111 measures the time with a timer (not shown) and periodically performs speaker detection control on the speaker detection unit 114. In addition, after receiving the speaker detection request, the control unit 111 stops the speaker detection control for the speaker detection unit 114 unless a new speaker detection request is received for a predetermined time or longer.
Further, the control unit 111 controls processing of the transmission sound mixer 115 according to the selected connection mode. Specifically, the process control is set depending on whether the mesh connection mode is as shown in FIG. 3A or the cascade connection mode is as shown in FIG.

ネットワークＩ／Ｆ１１２は、ルータ１３Ａを介して外部のネットワーク２００と接続し、音声データ通信網２０１を介して入力される放収音装置１１Ｂ〜１１Ｄの収音データを、音声データ通信網２０１の通信データ形式から所定の音声データ形式に変換して、放音用ミキサ１１３および送信音用ミキサ１１５へ与える。ネットワークＩ／Ｆ１１２は、マイクＭＩＣで収音され、送信用ミキサ１１５で必要に応じてミキシングされた収音データを、音声データ通信網２０１の通信データ形式に変換して、音声データ通信網２０１を介して放収音装置１１Ｂ〜１１Ｄへ送信する。また、ネットワークＩ／Ｆ１１２は、話者検出部１１４で生成された話者検出データを、制御データ通信網２０２を介して制御サーバ１０へ送信する。 The network I / F 112 is connected to the external network 200 via the router 13 A, and the sound collection data of the sound emission and collection devices 11 B to 11 D input via the voice data communication network 201 is transmitted to the voice data communication network 201. The data format is converted into a predetermined audio data format, and is given to the sound output mixer 113 and the transmission sound mixer 115. The network I / F 112 converts the collected sound data collected by the microphone MIC and mixed as necessary by the transmission mixer 115 into the communication data format of the voice data communication network 201, and the voice data communication network 201 is To the sound emission and collection devices 11B to 11D. Further, the network I / F 112 transmits the speaker detection data generated by the speaker detection unit 114 to the control server 10 via the control data communication network 202.

放音用ミキサ１１３は、他の放収音装置１１Ｂ〜１１Ｄで収音された音声データをミキシングしてスピーカＳＰへ出力する。
スピーカＳＰは、入力された音声データにより放音する。
マイクＭＩＣは、自装置周りの音声を収音して、収音データを話者検出部１１４および送信音用ミキサ１１５へ与える。
話者検出部１１４は、予め話者検出閾値が設定、記憶されており、入力された収音データのレベルが話者検出閾値以上であれば、自装置周りに在席する話者が発話中であると判断し、「発話有」を示す話者検出データを生成する。一方、話者検出部１１４は、入力された収音データのレベルが話者検出閾値未満であれば「発話者無」と判断し、発話無を示す話者検出データを生成する。 The sound emission mixer 113 mixes the sound data collected by the other sound emission and collection devices 11B to 11D and outputs the mixed sound data to the speaker SP.
The speaker SP emits sound according to the input audio data.
The microphone MIC collects the sound around the own device and provides the collected sound data to the speaker detection unit 114 and the transmission sound mixer 115.
The speaker detection unit 114 has a speaker detection threshold set and stored in advance. If the level of the input sound pickup data is equal to or higher than the speaker detection threshold, a speaker who is present around the device is speaking. Is determined, and speaker detection data indicating “spoken” is generated. On the other hand, if the level of the input sound pickup data is less than the speaker detection threshold, the speaker detection unit 114 determines that there is no speaker and generates speaker detection data indicating no speech.

送信音用ミキサ１１５は、制御部１１１からの制御に従い、メッシュ接続モードであれば、収音データをそのまま出力する。送信音用ミキサ１１５は、図３（Ｂ）に示すようなカスケード接続モードであれば、自装置の収音データに、他の放収音装置で収音した音声データをミキシングして出力する。この際、送信音用ミキサ１１５は、送信先を指定するデータを備え、送信先毎に送信用の音声データを生成する。具体的には、送信音用ミキサ１１５は、送信先を除く放収音装置で収音された音声データと、自装置の収音データとをミキシングして、送信先情報に関連付けして出力する。この処理は送信先毎に行われる。 In accordance with control from the control unit 111, the transmission sound mixer 115 outputs sound collection data as it is in the mesh connection mode. In the cascade connection mode as shown in FIG. 3B, the transmission sound mixer 115 mixes and outputs the sound data collected by the other sound emission and collection devices to the sound collection data of its own device. At this time, the transmission sound mixer 115 includes data designating a transmission destination, and generates audio data for transmission for each transmission destination. Specifically, the transmission sound mixer 115 mixes the sound data collected by the sound emission and collection device excluding the transmission destination and the sound collection data of the own device, and outputs the result in association with the transmission destination information. . This process is performed for each transmission destination.

このような構成とすることで、放収音装置１１Ａ（１１Ｂ〜１１Ｄ）は、音声データ通信網２０１を利用して、音声データを直接放収音装置間で送受信する。また、放収音装置１１Ａ（１１Ｂ〜１１Ｄ）は、話者検出データを制御サーバ１０からの話者検出要求に応じて取得し、制御サーバ１０へ送信する。 With such a configuration, the sound emitting and collecting apparatus 11A (11B to 11D) directly transmits and receives audio data between the sound emitting and collecting apparatuses using the audio data communication network 201. The sound emission and collection device 11 A (11 B to 11 D) acquires speaker detection data in response to a speaker detection request from the control server 10 and transmits it to the control server 10.

制御サーバ１０は、図４に示すように、通信制御部１５０、スケジュール制御部１５１、話者情報管理部１５２、ネットワークＩ／Ｆ１５３、表示部１５４を備える。
通信制御部１５０は、ネットワークＩ／Ｆ１５３を介して行われる放収音装置１１Ａ〜１１Ｄとの制御データ通信網２０２を利用した通信の制御を行う。具体的には、通信制御部１５０は、音声データの送受信の制御は行わず、制御データ通信網２０２を利用して各放収音装置１１Ａ〜１１Ｄ間の音声データ通信網２０１の確立や解除の制御を行う。また、通信制御部１５０は、各放収音装置１１Ａ〜１１Ｄに対して、話者検出データの要求を実行する制御や話者検出データの受信制御を行う。また、通信制御部１５０は、話者状況データ通信網２０４を介して行われる話者状況データの通信の制御を行う。 As illustrated in FIG. 4, the control server 10 includes a communication control unit 150, a schedule control unit 151, a speaker information management unit 152, a network I / F 153, and a display unit 154.
The communication control unit 150 controls communication using the control data communication network 202 with the sound emission and collection devices 11A to 11D performed via the network I / F 153. Specifically, the communication control unit 150 does not control transmission / reception of voice data, and uses the control data communication network 202 to establish or cancel the voice data communication network 201 between the sound emission and collection devices 11A to 11D. Take control. Moreover, the communication control part 150 performs control which performs the request | requirement of speaker detection data with respect to each sound emission and collection apparatus 11A-11D, and reception control of speaker detection data. Further, the communication control unit 150 controls communication of speaker status data performed via the speaker status data communication network 204.

スケジュール制御部１５１は、タイマ機能を備え、予め入力された会議等のスケジュールを記憶し、会議の開始タイミングを検出すると、通信制御部１５０へ音声データ通信網２０１の確立制御を行うように通知する。また、会議の終了タイミングを検出すると、通信制御部１５０へ音声データ通信網２０１の解除制御を行うように通知する。また、スケジュール制御部１５１は、話者情報管理部１５２へ記憶されているスケジュールや現在会議実行中であるかどうかや時刻情報を与える。 The schedule control unit 151 has a timer function, stores a schedule of a conference that is input in advance, and notifies the communication control unit 150 to perform establishment control of the voice data communication network 201 when the conference start timing is detected. . When the conference end timing is detected, the communication control unit 150 is notified to cancel the voice data communication network 201. Further, the schedule control unit 151 gives the stored schedule, whether the conference is currently being executed, and time information to the speaker information management unit 152.

話者情報管理部１５２は、通信制御部１５０を介して、各放収音装置１１Ａ〜１１Ｄの話者検出データを取得し、スケジュール制御部１５１からのスケジュールや時刻情報とリンクさせて、時系列での話者情報を含む話者状況データを生成する。この際、話者情報と放収音装置とは関連付けされている。生成された話者状況データは、表示部１５４へ与えられるとともに記憶され、通信制御部１５０を介して外部の映像処理装置１２Ａ等から要求されれば、通信制御部１５０、ネットワークＩ／Ｆ１５３、話者状況データ通信網２０４を介して、要求元へ送信される。 The speaker information management unit 152 acquires the speaker detection data of each of the sound emission and collection devices 11A to 11D via the communication control unit 150, and links with the schedule and time information from the schedule control unit 151, so that time series Speaker status data including speaker information is generated. At this time, the speaker information and the sound emission and collection device are associated with each other. The generated speaker status data is given to the display unit 154 and stored, and if requested by the external video processing device 12A or the like via the communication control unit 150, the communication control unit 150, the network I / F 153, the talk It is transmitted to the request source via the person situation data communication network 204.

表示部１５４は、液晶パネル等からなり、話者情報管理部１５２から与えられたスケジュール、時刻情報、および話者状況データに基づいて、話者の変化が視覚的に取得できるようなグラフィカル表示を行う。これにより、制御サーバ１０が設置された位置にいる人は、話者の状況を容易に且つ視覚的に確認することができる。 The display unit 154 is composed of a liquid crystal panel or the like, and displays a graphical display that allows the change of the speaker to be visually acquired based on the schedule, time information, and speaker status data given from the speaker information management unit 152. Do. Thereby, the person in the position where the control server 10 is installed can easily and visually confirm the situation of the speaker.

また、前述のように、制御サーバ１０の外部に話者状況データを出力できることで、各会議室１０１Ａ〜１０１Ｄで話者状況データを容易に利用することができる。例えば、話者状況データを映像処理装置１２Ａで利用する場合として、「発話有」情報を取得すると、当該「発話有」情報に関連付けされた放収音装置に対応する映像処理装置からの話者画像を他の映像処理装置からの画像よりも強調表示することができる。これにより、より発言者の判りやすい会議を提供することができる。 Further, as described above, since the speaker status data can be output to the outside of the control server 10, the speaker status data can be easily used in each of the conference rooms 101A to 101D. For example, when the speaker status data is used by the video processing device 12A, when the “speech” information is acquired, the speaker from the video processing device corresponding to the sound emitting and collecting device associated with the “speech” information. The image can be displayed more highlighted than images from other video processing devices. Thereby, it is possible to provide a conference that is more easily understood by the speaker.

次に、本実施形態のネットワーク通信システムにおける、音声データ通信網２０１の確立および話者検出の処理フローについて、図６および図７を参照して、より具体的に説明する。
図６は音声データ通信網２０１の確立処理フローを示したフローチャートである。
図７は音声データ通信網２０１の確立後における話者検出の処理フローを示したフローチャートである。 Next, the processing flow for establishing the voice data communication network 201 and detecting the speaker in the network communication system of the present embodiment will be described more specifically with reference to FIGS.
FIG. 6 is a flowchart showing the establishment processing flow of the voice data communication network 201.
FIG. 7 is a flowchart showing a processing flow of speaker detection after the establishment of the voice data communication network 201.

なお、以下の説明では、放収音装置１１Ａから放収音装置１１Ｂに接続を行う場合を示すが、他の放収音装置に対しても同様の処理を行えばよい。また、下記のような制御データ通信網２０２を利用した制御データの送受信は、具体的には、所謂ＳＩＰにより実現される。 In the following description, a case where connection is made from the sound emitting and collecting device 11A to the sound emitting and collecting device 11B is shown, but the same processing may be performed for other sound emitting and collecting devices. Further, transmission / reception of control data using the control data communication network 202 as described below is specifically realized by so-called SIP.

まず、制御サーバ１０は、予め設定、記憶されたスケジュールを読み出し（Ｓ１０１）、現在時刻を取得する（Ｓ１０２）。制御サーバ１０は、取得時刻が読み出したスケジュールの会議開始時刻になると（Ｓ１０３：Ｙ）、会議に参加する放収音装置の内の一つの放収音装置１１Ａへ、同じ会議に参加する他の放収音装置１１Ｂへの発信要求を、制御データ通信網２０２を介して送信する（Ｓ１０４）。 First, the control server 10 reads a previously set and stored schedule (S101), and obtains the current time (S102). When the acquisition time is the conference start time of the read schedule (S103: Y), the control server 10 sends another sound emission / collection device 11A of the sound emission / collection devices participating in the conference to the other conference participants. A transmission request to the sound emission and collection device 11B is transmitted via the control data communication network 202 (S104).

放収音装置１１Ａは、これを利用するユーザ等の電源入力やタイマによる自動電源入力によりスタンバイ状態となっており（Ｓ２０１）、制御サーバ１０から発信要求を検知すると、これを受信する（Ｓ２０２）。放収音装置１１Ａは、発信要求に従い、放収音装置１１Ｂへの発呼を、制御データ通信網２０２を介して制御サーバ１０へ送信し（Ｓ２０３）、制御サーバ１０は、この発呼を中継して、制御データ通信網２０２を介して放収音装置１１Ｂへ送信する（Ｓ１０５）。 The sound emission and collection device 11A is in a standby state by a power input of a user who uses this or an automatic power input by a timer (S201), and receives a call request when it is detected from the control server 10 (S202). . The sound emission and collection device 11A transmits a call to the sound emission and collection device 11B to the control server 10 via the control data communication network 202 in accordance with the outgoing call request (S203), and the control server 10 relays the call. Then, it transmits to the sound emitting and collecting apparatus 11B via the control data communication network 202 (S105).

放収音装置１１Ｂも、放収音装置１１Ａと同様に、予めスタンバイ状態にあり（Ｓ３０１）、制御サーバ１０を介した放収音装置１１Ａからの呼に従い、着呼する（Ｓ３０２）。この際、制御サーバ１０からは、制御データ通信網２０２を介して着信要求が送信されており（Ｓ１０６）、放収音装置１１Ｂは、着呼の後に着信要求を受信する（Ｓ３０３）。 Similarly to the sound emission and collection device 11A, the sound emission and collection device 11B is in a standby state in advance (S301), and receives a call according to the call from the sound emission and collection device 11A via the control server 10 (S302). At this time, an incoming request is transmitted from the control server 10 via the control data communication network 202 (S106), and the sound emitting and collecting apparatus 11B receives the incoming request after the incoming call (S303).

放収音装置１１Ｂは、自身がビジーでないことを確認すると、音声データ通信網２０１での通信開始可能情報を、制御データ通信網２０２を介して制御サーバ１０へ送信し（Ｓ３０４）、制御サーバ１０は、この通信開始可能情報を中継して放収音装置１１Ａへ送信する（Ｓ１０７）。 When the sound emitting and collecting device 11B confirms that it is not busy, it transmits the communication start possibility information in the voice data communication network 201 to the control server 10 via the control data communication network 202 (S304). Relays this communication start possible information and transmits it to the sound emission and collection device 11A (S107).

放収音装置１１Ａは、通信開始可能情報を受信すると（Ｓ２０４）、音声データ通信網２０１での通信開始制御を、制御データ通信網２０２を利用し、制御サーバ１０を介して、放収音装置１１Ｂに送信する（Ｓ２０５→Ｓ１０８→Ｓ３０５）。放収音装置１１Ｂは、この放収音装置１１Ａからの通信開始制御に、制御データ通信網２０２を利用し、制御サーバ１０を介して、放収音装置１１Ａへ応答する（Ｓ３０５→Ｓ１０８→Ｓ２０５）。 When receiving the communication start enable information (S204), the sound emission and collection device 11A performs communication start control in the voice data communication network 201 via the control server 10 using the control data communication network 202. 11B (S205 → S108 → S305). The sound emitting and collecting apparatus 11B uses the control data communication network 202 to control communication start from the sound emitting and collecting apparatus 11A, and responds to the sound emitting and collecting apparatus 11A via the control server 10 (S305 → S108 → S205). ).

このような通信開始制御の確認が終了すると、放収音装置１１Ａと放収音装置１１Ｂとの間で音声データ通信網２０１による伝送経路が確立される（Ｓ２０６，Ｓ３０６）。 When the confirmation of such communication start control is completed, a transmission path by the voice data communication network 201 is established between the sound emitting and collecting apparatus 11A and the sound emitting and collecting apparatus 11B (S206, S306).

放収音装置１１Ａと放収音装置１１Ｂとは、音声データ通信網２０１が確立されると、制御サーバ１０を介することなく、音声データの送受信を行う（Ｓ２０７，Ｓ３０７）。 When the voice data communication network 201 is established, the sound emitting and collecting apparatus 11A and the sound emitting and collecting apparatus 11B perform transmission and reception of voice data without going through the control server 10 (S207, S307).

このように、本実施形態の構成では、放収音装置間での音声データを送受信する音声データ通信網２０１の確立までは、制御サーバ１０により先導的に実行され、音声データ通信網２０１が確立されれば、それぞれの放収音装置間で直接音声データを送受信することができる。これにより、制御サーバ１０は、ミキシング機能を有する大規模で複雑なものを用いなくてもよい。 As described above, in the configuration of the present embodiment, until the establishment of the voice data communication network 201 that transmits and receives voice data between the sound emission and collection devices, the control server 10 performs the lead and establishes the voice data communication network 201. If it does so, audio | voice data can be directly transmitted / received between each sound emission and collection apparatus. Thereby, the control server 10 does not need to use a large-scale and complicated one having a mixing function.

次に、音声データ通信網２０１が確立されると、制御サーバ１０は計時を開始する（Ｓ１１１）。制御サーバ１０は、予め設定した話者検出要求送信タイミングを検出すると（Ｓ１１２：Ｙ）、話者検出要求を接続が確立された各放収音装置１１Ａ，１１Ｂへ、制御データ通信網２０２を介して送信する（Ｓ１１３）。一方で、制御サーバ１０は、話者検出要求送信タイミングでなければ（Ｓ１１２：Ｎ）、計時とタイミング検出を継続する（Ｓ１１１）。 Next, when the voice data communication network 201 is established, the control server 10 starts timing (S111). When detecting the preset speaker detection request transmission timing (S112: Y), the control server 10 sends the speaker detection request to each sound emitting and collecting apparatus 11A, 11B to which connection has been established via the control data communication network 202. (S113). On the other hand, if it is not the speaker detection request transmission timing (S112: N), the control server 10 continues timing and timing detection (S111).

以下、放収音装置１１Ａと放収音装置１１Ｂとは同じ処理を行うので、放収音装置１１Ａのみについて説明する。
放収音装置１１Ａは、話者検出要求を受け付けると（Ｓ２１１）、検出要求を受け付けたことを示す確認データを制御サーバ１０へ制御データ通信網２０２を介して送信する（Ｓ２１２）。制御サーバ１０は、この確認データを取得することで、放収音装置１１Ａが話者検出を開始するものと判断し（Ｓ１１４）、話者検出データの受信待機状態となる。 Hereinafter, since the sound emission and collection device 11A and the sound emission and collection device 11B perform the same processing, only the sound emission and collection device 11A will be described.
When receiving the speaker detection request (S211), the sound emission and collection device 11A transmits confirmation data indicating that the detection request has been received to the control server 10 via the control data communication network 202 (S212). The control server 10 obtains this confirmation data, thereby determining that the sound emission and collection device 11A starts speaker detection (S114), and enters a standby state for reception of speaker detection data.

放収音装置１１Ａは、計時を開始し（Ｓ２１３）、検出タイミングになると（Ｓ２１４：Ｙ）、前述のように話者検出と話者検出データの生成とを実行し、制御サーバ１０へ話者検出データを送信する（Ｓ２１５）。この際、放収音装置１１Ａは、制御データ通信網２０２を用いて、制御サーバ１０へ話者検出データを送信する。 The sound emission and collection device 11A starts measuring time (S213), and at the detection timing (S214: Y), executes speaker detection and generation of speaker detection data as described above, and sends the speaker to the control server 10. The detection data is transmitted (S215). At this time, the sound emission and collection device 11 A transmits the speaker detection data to the control server 10 using the control data communication network 202.

放収音装置１１Ａは、話者検出および話者検出データの生成、送信を検出期間終了になるまで継続して行う（Ｓ２１６：Ｎ→Ｓ２１３）。ここで、検出期間終了は、例えば、次のように決定される。（１）放収音装置１１Ａは、制御サーバ１０からの話者検出要求を受け付けて、話者検出を行う場合に、予め設定した所定時間に、新たな話者検出要求を受け付けなければ、終了を決定する。（２）放収音装置１１Ａは、図示していないが、制御サーバ１０から話者検出要求の終了を示す話者検出要求終了データを受信した場合に、終了を決定する。なお、検出期間終了を検出した場合、放収音装置１１Ａは、制御サーバ１０への話者検出データの取得・送信を停止する（Ｓ２１６：Ｙ→Ｓ２１７）。 The sound emission and collection device 11A continuously performs speaker detection and generation and transmission of speaker detection data until the detection period ends (S216: N → S213). Here, the end of the detection period is determined as follows, for example. (1) The sound emission and collection device 11A receives a speaker detection request from the control server 10 and performs speaker detection, and ends if it does not receive a new speaker detection request at a preset time. To decide. (2) Although not shown, the sound emission and collection device 11 A determines the end when the speaker detection request end data indicating the end of the speaker detection request is received from the control server 10. When the end of the detection period is detected, the sound emission and collection device 11A stops the acquisition and transmission of the speaker detection data to the control server 10 (S216: Y → S217).

制御サーバ１０は、放収音装置１１Ａからの話者検出データを受信して記憶し（Ｓ１１５）、話者状況データの生成・更新を行う（Ｓ１１６）。すなわち、制御サーバ１０は、話者検出データを初めて取得した場合には、話者状況データを新規に生成し、これ以降に順次話者検出データを取得すると、直前の話者状況データに対して、新たな話者検出データを付け加えるように更新する。このような話者状況データの生成・更新は、スケジュールで設定された会議終了タイミングか、ユーザにより操作入力された終了タイミングまで実行され続ける（Ｓ１１７→Ｓ１１１）。そして、終了時間を検出すると、制御サーバ１０は、話者状況データの記憶を行い、終了処理を行う（Ｓ１１７：Ｎ→Ｓ１１８）。 The control server 10 receives and stores speaker detection data from the sound emission and collection device 11A (S115), and generates and updates speaker status data (S116). That is, when acquiring the speaker detection data for the first time, the control server 10 newly generates speaker status data, and when acquiring the speaker detection data sequentially thereafter, , Update to add new speaker detection data. Such generation / update of the speaker status data continues to be executed until the conference end timing set in the schedule or the end timing input by the user (S117 → S111). When the end time is detected, the control server 10 stores the speaker status data and performs end processing (S117: N → S118).

このような処理を行うことで、制御サーバ１０は、音声データの通信制御を行うことなく、各放収音装置１１Ａ〜１１Ｄの話者検出データすなわち発話者情報を取得することができる。これにより、大規模で高価なサーバを用いることなく、前述のような話者検出データに基づく各種の応用処理を実現することができる。さらに、この際、話者検出データおよび話者状況データの送受信は、音声データ通信網２０１を用いないので、当該音声データ通信網２０１を伝送する音声データの送受信に影響を与えることも、これらの送受信による影響を与えられることもない。すなわち、独立して話者検出データおよび話者状況データの送受信を行うことができる。そして、このような話者検出データおよび話者状況データの送受信は、比較的データ量が小さいので、容量の大きくない伝送経路を用いても、音声データとのタイムラグを生じることなく、音声データと話者状況データとを各会議室１０１Ａ〜１０１Ｄで利用することができる。 By performing such processing, the control server 10 can acquire speaker detection data, that is, speaker information, of each of the sound emission and collection devices 11A to 11D without performing communication control of voice data. Accordingly, various application processes based on the speaker detection data as described above can be realized without using a large-scale and expensive server. Further, at this time, since the voice data communication network 201 is not used for the transmission / reception of the speaker detection data and the speaker status data, the transmission / reception of the voice data transmitted through the voice data communication network 201 may be affected. It is not affected by transmission and reception. That is, the speaker detection data and the speaker status data can be transmitted and received independently. And since the transmission / reception of such speaker detection data and speaker status data has a relatively small amount of data, even if a transmission path with a small capacity is used, there is no time lag between the voice data and the voice data. The speaker status data can be used in each of the conference rooms 101A to 101D.

１０１Ａ〜１０１Ｄ−会議室、１０−制御サーバ、１１Ａ〜１１Ｄ−放収音装置、１２Ａ〜１２Ｄ−映像処理装置、１３Ａ〜１３Ｅ−ルータ、２００−ネットワーク、２０１−音声データ通信網、２０２−制御データ通信網、２０３−映像データ通信網、２０４−話者状況データ通信網、
１１１−放収音装置の制御部、１１２−放収音装置のネットワークＩ／Ｆ、１１３−放音用ミキサ、１１４−話者検出部、１１５−送信音用ミキサ、
１５０−制御サーバ１０の通信制御部、１５１−スケジュール制御部、１５２−話者情報管理部、１５３−制御サーバ１０のネットワークＩ／Ｆ、１５４−表示部 101A to 101D-conference room, 10-control server, 11A to 11D-sound emitting and collecting device, 12A to 12D-video processing device, 13A to 13E-router, 200-network, 201-voice data communication network, 202-control data Communication network, 203-video data communication network, 204-speaker status data communication network,
111-Control unit of sound emission and collection device, 112-Network interface of sound emission and collection device, 113-Mixer for sound emission, 114-Speaker detection unit, 115-Mixer for transmission sound,
150-Communication control unit of control server 10, 151-Schedule control unit, 152-Speaker information management unit, 153-Network I / F of control server 10, 154-Display unit

Claims

A plurality of sound emitting and collecting devices connected to a voice data communication network for communicating voice data and each emitting and collecting sound;
A communication control server for controlling connection between a plurality of sound emission and collection devices;
A network communication system comprising:
Each of the plurality of sound emission and collection devices includes a speaker detection means,
The communication control server performs speaker detection information acquisition control for each of the plurality of sound emission and collection devices,
When receiving the speaker detection information acquisition control, the plurality of sound emission and collection devices transmit the speaker detection information acquired by the speaker detection means to the communication control server,
The network communication system, wherein the voice data is directly communicated between a plurality of sound emission and collection devices without going through the communication control server.

2. The network according to claim 1, wherein the speaker detection unit generates the speaker detection information when the level of sound collection data input to the sound emission and collection device is equal to or higher than a predetermined threshold. Communications system.

The communication control server includes a speaker information management unit that generates speaker status data in which the acquired speaker detection information is associated with the sound emission and collection device that has transmitted the speaker detection information. The network communication system according to claim 1 or 2, characterized in that

The communication control server stores in advance a schedule of communication by the plurality of sound emission and collection devices, and generates the speaker situation data in time series based on the schedule or time information. 4. The network communication system according to 3.

A plurality of video processing means corresponding to the plurality of sound emission and collection devices, respectively, for acquiring and displaying a video,
The network communication system according to claim 3 or 4, wherein the plurality of video processing means display the acquired speaker status data.

The network communication system according to claim 5, wherein the plurality of video processing means perform switching control of video to be displayed based on the speaker status data.